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1.  Introduction 

The  purpose  of  this  Department  of  Defense  Breast  Cancer  Research  Program  Career  Development  Award 
is  enabling  Dr.  Rutter  to  develop  biostatistical  methods  for  breast  cancer  research.  Dr.  Rutter’s  focus  is  on 
evaluating  the  accuracy  of  breast  cancer  screening.  This  four  year  program  includes  advanced  training  in 
the  epidemiology  of  breast  cancer,  training  in  clinical  detection  of  breast  cancer,  development  of  statistical 
methodology,  and  graduate  teaching.  A  basic  knowledge  of  the  epidemiology,  disease  process  and 
detection  of  breast  cancer  will  guide  the  development  of  statistical  methods  designed  to  address  analysis 
problems  encountered  when  evaluating  mammography.  Proposed  statistical  research  focuses  on  receiver 
operating  characteristic  (ROC)  analysis.  ROC  analysis  provides  accuracy  measures  for  ordinal  tests  and  is 
a  more  general  analysis  strategy  than  other  methods  devised  for  dichotomous  test  outcomes.  Therefore, 
the  proposed  research  will  have  implications  for  both  ordinal  scale  and  dichotomous  test  analyses. 
Additional  research  will  explore  accuracy  measures  specific  to  dichotomous  test  outcomes,  including 
sensitivity,  specificity,  and  positive  and  negative  predictive  values. 
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II.  Year  2  Achievements 

Technical  Obiective  1:  Gain  additional  training  in  breast  cancer  epidemiology,  detection  and  treatment 
During  this  second  award  year,  Dr.  Rutter  has  continued  to  attend  Breast  Cancer  Surveillance  Consortium 
(BCSC)  meetings  [1],  and  has  presented  her  research  to  the  BCSC’s  Statistical  Coordinating  Center.  The 
BCSC  is  a  multi-site  NCI-funded  study  evaluating  the  performance  of  mammography  in  a  community 
setting.  Ongoing  participation  in  BCSC  meetings  has  provided  Dr.  Rutter  with  important  practical 
information  about  radiologists’  interpretation  of  mammograms,  and  the  timing  and  execution  of 
diagnostic  procedures. 

Statistical  Research  Aims 

Dr.  Rutter  has  continued  to  expand  her  knowledge  of  statistical  methods  for  diagnostic  and  screening  test 
assessment.  In  particular,  her  thoughts  have  been  clarified  through  participation  in  the  Diagnostic  Methods 
working  group  at  the  University  of  Washington’s  Department  of  Biostatistics. 

Technical  Obiective  2:  Statistical  Research,  Aim  1:  Develop  methods  for  multiple  patient  assessments. 

Dr.  Rutter  has  continued  to  work  toward  publishing  her  article  describing  bootstrap  approaches  for  multi¬ 
site,  multi-reader  diagnostic  test  data.  A  brief  version  of  this  article  was  rejected  by  Biometrics.  A  more 
complete  version,  which  includes  percentile  intervals  and  comparisons  to  an  analytic  estimator,  has  been 
submitted  to  Academic  Radiology  (see  Appendix  B).  The  analysis  used  for  this  bootstrap  paper  can  be 
conducted  using  a  relatively  simple  SAS  macro.  During  her  third  funding  year.  Dr.  Rutter  will  generalize 
this  macro  and  examine  ways  to  make  it  more  broadly  available,  for  example  via  the  SAS  users  group 
webpage. 

Dr.  Rutter  has  no  plans  for  further  development  of  this  research  pathway.  This  reflects  her  increased 
knowledge  of  data  collection  and  use  by  radiologists,  and  also  reflects  recent  developments  in  statistical 
methodology.  As  described  in  her  year  one  progress  report,  generalized  estimating  approaches  for 
diagnostic  data  have  been  fully  developed  [2,3].  These  models  can  accommodate  correlated  rating  data, 
and  estimation  of  models  can  be  carried  out  using  standard  statistical  software  packages.  In  addition,  the 
limited  robustness  of  ROC  curve  analysis  [4]  limits  the  usefulness  of  robust  covariance  adjustments. 


More  importantly,  multi-site  data  are  not  likely  to  be  used  when  assessing  screening  mammography.  In 
the  screening  setting,  laterality  is  important  but  quadrant  location  within  the  breast  is  less  important 
because  women  go  on  to  further  diagnostic  assessment.  Because  screening  assessments  affect  the  entire 
woman,  rather  than  the  breast,  statistics  that  deal  with  data  at  the  woman-level  are  most  informative.  The 
best  approach  to  these  data  combines  breast-level  ratings  in  conjunction  with  disease  state.  When  a 
woman  has  bilateral  disease,  the  lowest  (least  likelihood  of  disease)  rating  given  to  her  two  breasts  is  used 
for  analyses,  capturing  potential  undercalling  of  disease.  When  a  woman  has  unilateral  disease,  the  rating 
given  to  her  diseased  breast  is  used  for  analyses.  Although  this  ignores  overcalling  in  the  non-diseased 
breast,  it  incorporates  critical  disease  detection  information.  When  a  woman  does  not  have  disease,  the 
maximum  (most  likelihood  of  disease)  rating  is  used,  capturing  overcalling  of  disease. 
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In  the  diagnostic  setting,  both  laterality  and  quadrant  are  important.  These  data  can  be  analyzed  using  GEE 
methods  [2,3].  Unfortunately,  rating  data  are  generally  not  collected  at  the  quadrant  level.  Analysis  at  the 
quadrant  level  can  also  be  limited  by  the  accuracy  of  localization  by  the  gold  standard  (e.g.,  pathology 
reports  or  cancer  registry  outcomes). 

Technical  Objective  3:  Statistical  Research.  Aim  2:  Extend  exact  methods  for  ordinal  regression  models 
Development  of  methods  for  small  samples  has  been  deferred  to  year  3.  Instead,  Dr.  Rutter  has  focused 
on  methods  for  adjusting  for  measurement  error  in  disease  status. 


Technical  Objective  4:  Statistical  Research,  im  3:  Develop  methods  to  adjust  for  measurement  error  in 
disease  status 

Several  authors  have  explored  methods  for  estimating  test  accuracy  when  there  are  multiple  test  outcomes 
with  no  true  gold  standard.[5-ll]  Some  articles  have  described  methods  that  allow  estimation  of  accuracy 
in  the  complete  absence  of  gold  standard  information.  [10,1 1]  Over  the  last  year,  this  topic  has  been  of 
great  interest  to  the  Diagnostic  Methods  working  group.  Methods  that  handle  missing  disease  status  rely 
on  latent  variable  approaches  when  the  ‘definitive’  diagnosis  is  uncertain.  These  solutions  are  not 
satisfying  because  they  hinge  on  a  latent,  unobserved,  disease  state.  In  this  case,  the  referent  populations 
are  unknown,  making  comparisons  across  studies,  or  from  studies  to  clinical  practice,  extremely  difficult. 
Consider  a  situation  where  misclassification  can  be  extreme:  screening  for  alcohol  abuse  and  dependence. 
Suppose  there  was  a  new  blood  test  for  alcohol  abuse  and  dependence.  To  test  the  accuracy  of  this  new 
blood  test,  the  natural  reference  standard  is  the  Diagnosis  and  Statistical  Manual  (DSM)  definition  of 
alcohol  abuse  and  dependence  [12],  assessed  using  a  questionnaire.  This  reference  standard  is  likely  to 
misdiagnose  some  people.  However,  one  of  the  key  purposes  of  the  DSM  diagnostic  guidelines  is 
standardization  that  allows  comparability  of  independent  research.  New  statistical  methods  allow 
estimation  of  sensitivity  and  specificity  relative  to  an  unobserved  true  state.  Unfortunately  these 
approaches  leave  the  research  community  to  ponder  the  meaning  of  these  sensitivities  and  specificities. 
Exactly  what  the  test  detects  is  unclear  because  it  is  essentially  undefined,  rendering  these  estimates 
useless. 

An  alternative  approach  is  to  clearly  define  the  reference  standard,  and  to  improve  reference  standards  as 
necessary.  In  the  context  of  screening  mammography,  the  accepted  standard  is  biopsy  and  two  years  of 
follow-up  data.  Currently,  a  cancer  is  “missed”  by  screening  if  it  occurs  within  two  years  of  a  disease¬ 
negative  screening.  Incorporation  of  stage-of-disease  information  could  improve  this  gold  standard.  In 
this  case,  a  cancer  is  “missed”  by  screening  if  it  occurs  within  two  years  of  a  disease-negative  and  the 
woman  is  node  positive.  Other  information,  such  as  tumor  size  or  grade,  could  be  incorporated  into  this 
definition.  This  approach  acknowledges  the  goal  of  screening  mammography:  early  detection  of  disease. 


Technical  Objective  5:  Develop  and  teach  a  course  in  methods  for  assessing  diagnostic  tests. 

As  the  1999  Genentech  Distinguished  Professor,  Dr.  Margaret  Pepe  will  teach  a  special  topics  course  on 
Medical  Diagnostic  Testing  at  the  University  of  Washington’s  Department  of  Biostatistics  in  Spring  2000. 
Dr.  Rutter  will  work  with  Dr.  Pepe  developing  course  materials  and  lectures,  and  will  guest  lecture. 
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Progress  Toward  Other  Grant  Aims 

Dr.  Rutter  is  working  toward  publishing  her  article  (co-authored  with  Constantine  Gatsonis)  describing 
models  for  meta-analysis  of  diagnostic  test  data  (see  Appendix  C).  These  methods  were  developed  as  a 
way  to  appropriately  combine  sensitivities  and  specificities  from  several  studies,  but  can  also  be  applied  to 
sensitivity/specificity  results  from  several  sites  within  a  multi-site  study  or  from  several  mammography 
centers  within  a  study.  This  article  was  recently  rejected  by  Statistics  in  Medicine,  and  Dr.  Rutter  is  in  the 
process  of  revising  the  article  for  resubmission  to  an  alternate  journal. 

Dr.  Rutter  has  examined  the  correlation  between  test  and  clinical  performance  measures  of 
mammographic  interpretation  (see  Appendix  D).  This  article  was  submitted  to  Journal  of  Clinical 
Epidemiology  and  is  in  the  process  of  a  second  review  following  an  encouraging  “revise  and  resubmit”. 
Direct  estimation  of  mammographers'  clinical  accuracy  requires  the  ability  to  capture  screening 
assessments  and  correctly  identify  which  screened  women  have  breast  cancer.  This  clinical  information  is 
often  unavailable  and  when  it  is  available  its  observational  nature  can  cause  analytic  problems.  Problems 
with  clinical  data  have  led  some  researchers  to  evaluate  mammographers  using  a  single  set  of  films. 
Research  based  on  these  test  film  sets  implicitly  assumes  a  correspondence  between  mammographers' 
accuracy  in  the  test  setting  and  their  accuracy  in  a  clinical  setting.  However,  there  is  no  evidence 
supporting  this  basic  assumption.  Dr.  Rutter  used  hierarchical  models  and  data  from  27  mammographers 
to  directly  compare  accuracy  estimated  from  clinical  practice  data  to  accuracy  estimated  from  a  test  film 
set.  There  was  no  evidence  of  correlation  between  clinical  and  test  accuracy.  These  findings  raise 
important  questions  about  how  mammographer  accuracy  should  be  measured.  Dr.  Rutter  has  presented 
these  findings  to  the  BCSC,  and  plans  to  present  to  a  wider  audience  at  the  International  Conference  on 
Health  Policy  Statistics:  Methodologic  Issues  in  Health  Services  and  Outcomes  Research  in  December 
1999. 

During  her  year  two  funding  period,  Dr.  Rutter  also  coauthored  a  paper  describing  the  design  of  the 
mammography  rereading  studies.  [13]  (see  Appendix  E) 

In  the  last  two  years  of  grant  funding.  Dr.  Rutter  plans  to  shift  her  research  goals  somewhat,  to  better  align 
them  with  current  needs  in  mammography  research.  One  new  goal  is  development  of  methods  that  handle 
data  collected  using  the  Breast  Imaging  Reporting  and  Data  System  (BI-RADS). [14]  This  standardized 
set  of  mammographic  interpretations  proscribed  by  the  American  College  of  Radiology  lexicon  improves 
data  collection  by  virtue  of  standardizaton.  However,  the  inclusion  of  an  interpretive  code  for  additional 
work-up  complicates  evaluation  of  mammographic  accuracy.  The  additional  work-up  category  does  not 
fit  neatly  into  an  ordinal  outcome  scale.  These  cases  include  a  mix  of  women,  for  example,  it  could 
naturally  include  both  women  with  suspected  cysts  (benign  disease)  and  women  with  suspicious  findings 
that  need  additional  evaluation.  Models  need  to  be  developed  to  handle  these  kinds  of  data.  One  possible 
approach  to  these  data  is  extension  of  two-part  models  employed  in  econometrics. [15]  The  first  part  of 
the  model  would  estimate  the  probability  of  an  interpretation  based  on  the  current  mammogram  (i.e., 
additional  workup  not  requested).  The  second  part  of  the  model  would  describe  ordinal  outcomes  among 
observations  with  an  interpretation  of  the  current  mammogram.  Inference  is  drawn  from  the  combined 
results  from  these  two  model  steps. 
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III.  Summary 

Dr.  Rutter  remains  on  track  with  her  stated  goals,  making  significant  progress  towards  proposed  research 
goals.  At  this  point  in  her  career  development  award,  she  has  shifted  some  of  her  proposed  research  in 
response  to  her  increased  knowledge  of  breast  cancer  screening  and  advances  in  statistical  methodology. 
Statistical  approaches  for  dealing  with  errors  in  the  gold  standard  no  longer  seem  feasible.  Instead,  the 
focus  of  research  should  be  on  development  of  more  adequate  reference  standards.  New  research 
problems  have  come  to  the  fore.  Dr.  Rutter  has  addressed  the  validity  of  assessing  mammographers 
accuracy  using  test  film  sets  and  in  the  future  plans  to  address  analytic  problems  related  to  the  BI-RADS 
data  collection  system. 


Key  Research  Accomplishments,  Year  2: 

•  Published  article:  Pepe  MS,  Urban  N,  Rutter  C,  Longton  G  “Design  of  a  study  to  improve  accuracy  in 
reading  mammograms,”  Journal  of  Clinical  Epidemiology,  50:  1327-38, 1997. 

•  Submitted  article:  Rutter  CM.  Bootstrap  estimation  of  diagnostic  accuracy  using  patient-clustered 
data,  Academic  Radiology. 

•  Submitted  article:  Rutter  CM,  Taplin  S.  Assessing  mammographers’  accuracy:  A  comparison  of 
clinical  and  test  performance.  Journal  of  Clinical  Epidemiology. 

•  Developed  a  clear  plan  for  teaching  and  developing  a  course  on  Medical  Diagnostic  Testing  at 
University  of  Washington  with  Dr.  Margaret  Pepe. 

•  Ongoing  participation  in  Diagnostic  Methods  working  group  at  the  University  of  Washington’s 
Department  of  Biostatistics 

•  Ongoing  participation  in  Breast  Cancer  Surveillance  Consortium  meetings 
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Appendix  A.  Statement  of  Work 

Technical  Objective  1:  Gain  additional  training  in  breast  cancer  epidemiology,  detection  and 
treatment. 

Task  1:  Months  1-4:  Review  of  information  on  the  epidemiology,  diagnosis  and  treatment  of  breast 
cancer  as  suggested  by  Dr.  Margaret  Mandelson. 

Task  2:  Months  1-48:  Attend  seminars  sponsored  by  the  Seattle  Breast  Cancer  Research  Program. 

Technical  Objective  2:  Statistical  research,  aim  1:  develop  methods  for  multiple  patient 
assessments. 

Task  3;  Month  6:  Review  current  research  for  generalized  estimating  equation  and  random  effect 
approaches  for  nonlinear  models. 

Task  4:  Months  -11:  Test  bootstrap,  robust  covariance  adjustment  and  generalized  estimating  equation 
methods  for  breast-level  analyses  using  simulation  studies. 

Task  5:  Months  12-21:  Develop  methods  for  woman-level  analysis,  possibly  including  software 
development  for  random  effects  in  generalized  ordinal  regression  models. 

Technical  Objective  3:  Statistical  research,  aim  2:  extend  exact  methods  for  ordinal  regression 
models 

Task  6:  Month  22:  Review  current  research  in  exact  methods. 

Task  7:  Months  23-34:  Extend  exact  methods  and  write  computational  algorithms  and  programs  to 
compute  distributions  of  sufficient  statistics. 

Technical  Objective  4:  Statistical  research,  aim  3:  Develop  methods  to  adjust  for  measurement 
error  in  disease  status 

Task  8:  Month  36:  Review  current  research  in  errors-in-measurement  models. 

Task  9:  Months  37-48:  Develop  simple  combined  corrections  for  verification  and  follow-up  bias.  These 
methods  will  be  extended  to  allow  adjustments  in  general  ordinal  regression  models. 

Technical  Objective  5:  Develop  and  teach  a  course  in  methods  for  assessing  diagnostic  tests. 

Task  10:  Months  1-24:  Collect  relevant  references  and  outlining  lectures  for  the  methods  course.  During 
this  time,  specific  lectures  may  be  presented  in  other  University  of  Washington  courses. 

Task  11:  Months  25-36:  Offer  methods  course  at  University  of  Washington  through  the  Department  of 
Biostatistics. 
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Bootstrap  Estimation  of  Diagnostic  Accuracy  using  Patient-clustered  Data 


Abstract 

Rationale  and  Objectives:  This  article  describes  simple  asymptotically  consistent  bootstrap 
estimation  of  test  accuracy  statistics.  Unlike  most  other  methods,  the  bootstrap  approach  can 
account  for  correlation  due  to  multiple  diagnostic  modalities,  multiple  readers,  and  assessment 
at  multiple  body  sites.  Bootstrap  methods  are  easy  to  apply,  even  in  complicated  settings. 
Methods:  The  performance  of  bootstrap  estimates  is  evaluated  and  compared  to  analytic  esti¬ 
mates  using  a  simulation  study.  Bootstrapping  is  demonstrated  using  data  from  a  study  comparing 
two  angiography  methods. 

Results:  Analytic  and  bootstrap  estimators  had  similar  coverage  rates.  Bootstrap  estimates  were 
slightly  better  in  some  cases,  and  analytic  estimators  were  slightly  better  in  others.  Bootstrap 
percentile  intervals  had  better  coverage  than  asymptotic  normal  bootstrap  intervals. 
Conclusions:  Bootstrapping  is  a  useful  method  for  estimating  confidence  intervals  for  the  area 
under  the  ROC  curve,  sensitivity  and  specificity  when  data  are  correlated. 

Keywords:  area  under  the  receiver  operating  characteristic  curve  (AUC),  sensitivity,  specificity. 
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1.  Introduction 


Diagnostic  evaluation  often  requires  simultaneously  assessing  disease  at  multiple  body  sites.  Ex¬ 
amples  of  multi-site  diagnostic  assessments  include  screening  mammography  to  detect  breast 
cancer,  computed  tomography  of  the  liver  to  detect  metastatic  colorectal  cancer(Zerhouni  et  al, 
1996),  and  magnetic  resonance  angiography  of  leg  vessels  to  detect  occlusive  peripheral  vascular 
disease  (Baum  et  al,  1995).  Although  the  accuracy  of  these  multi-site  tests  can  be  estimated 
using  information  from  a  single  body  site,  studies  that  use  all  available  information  have  more  sta¬ 
tistical  power.  Reducing  site  level  data  to  patient  level  data  is  the  simplest  approach  to  multi-site 
diagnostic  assessment.  However,  composite  patient-level  measures  of  true  state  and  test  outcome 
reduce  the  amount  of  information  about  test  accuracy  contained  in  multi-site  assessments.  These 
composite  measures  also  ignore  disease  localization,  information  that  can  be  more  important  for 
treatment  decisions  than  global  determination  of  disease  presence. 

Estimates  of  diagnostic  accuracy  that  use  multi-site  data  need  to  account  for  within  patient 
correlation.  Methods  for  handing  multiple  assessment  of  a  single  site,  by  different  modalities 
or  readers,  are  well  developed.  Song  (1997)  gives  an  overview  of  current  approaches.  These 
methods  require  that  patients  are  either  diseased  or  not  diseased,  and  do  not  allow  true  state  to 
across  the  different  sites  within  patients. 

When  disease  state  is  dichotomous,  logistic  regression  models  can  be  used  to  estimate  the 
relationship  between  true  state  and  test  outcomes  (e.g.,  Baum  et  al,  1995).  When  data  are 
clustered  within  patients,  standard  methods  can  be  used  to  adjust  the  logistic  regression  coefficient 
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covariance  matrix  for  within  patient  correlation  (Lipsitz  and  Harrington,  1990).  The  logistic  model 
conditions  on  test  results  and  estimates  their  association  with  disease  state.  These  models  do 
not  result  in  standard  accuracy  measures,  making  comparisons  to  other  studies  difficult. 

Obuchowski  described  a  method  for  estimating  standard  errors  for  the  area  under  the  em¬ 
pirical  receiver  operating  characteristic  curve  based  on  sums  of  squares  (Obuchowski,  1997). 
Obuchowski’s  method  allows  estimation  of  the  standard  error  of  the  ADC  for  a  single  test,  or  the 
standard  error  of  the  difference  between  AUC  statistics  for  two  tests.  Obuchowski’s  approach 
requires  definition  and  calculation  of  appropriate  sums  of  squares,  and  this  can  become  compli¬ 
cated  when  there  are  multiple  sources  of  correlation.  For  example,  when  patients  are  evaluated 
at  multiple  sites  by  more  than  one  test  with  each  test  independently  evaluated  by  more  than  one 
reader. 

Pepe  recently  proposed  a  general  regression  methodology  that  allows  multi-site  assessments 
(Pepe,  1998).  This  regression  approach  estimates  the  effects  of  covariates  on  the  receiver  op¬ 
erating  characteristic  (ROC)  curve.  The  interpretation  of  the  regression  model  depends  on  the 
functional  form  chosen  for  the  ROC  curve.  Coefficients  estimated  from  a  logistic  model  are  inter¬ 
pretable  as  the  log-odds  of  correctly  classifying  a  diseased  subject  for  a  fixed  false  positive  rate. 
Pepe  suggests  using  bootstrap  resampling  to  estimate  standard  errors  of  regression  coefficients 
when  correlated  data  are  included  in  these  models. 

This  article  demonstrates  a  very  simple  bootstrap  approach  for  estimating  true  positive  rates, 
false  positive  rates,  and  the  area  under  the  ROC  curve  for  multi-site  test  outcome  data.  This 
bootstrap  approach  is  useful  for  simple  comparisons  between  tests,  when  there  are  no  covariates. 
When  using  regression  approaches,  bootstrap  estimates  can  provide  supplemental  descriptive 
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statistics.  The  bootstrap  approach  is  easy  to  use  when  there  are  multiple  sources  of  correlation 
and  resulting  confidence  intervals  are  asymptotically  consistent. 

2.  Nonparametric  Accuracy  Statistics:  sensitivity,  specificity,  AUC 

The  accuracy  of  imaging  tests  is  based  on  radiologists’  interpretations  of  disease  state.  These 
interpretations  are  typically  measured  using  a  5-point  ordinal  scale  that  ranges  from  'definitely 
not  diseased’  to  ’definitely  diseased’.  True  positive  rates,  false  positive  rates,  and  the  area  under 
the  receiver  operating  characteristic  curve  are  the  basic  statistics  used  to  measure  test  accuracy. 
These  statistics  condition  on  true  disease  state,  treating  disease  state  as  fixed  and  known  and 
treating  test  outcomes  (ratings)  as  randomly  distributed  outcomes.  When  disease  state  is  known 
without  error,  these  accuracy  statistics  are  independent  of  disease  prevalence. 

When  test  outcomes  are  dichotomous,  sensitivity  and  specificity  measure  test  accuracy.  Sen¬ 
sitivity  is  the  probability  of  a  positive  test  outcome  (indicating  presence  of  disease)  when  the 
target  disease  is  present.  Specificity  is  the  probability  of  a  negative  test  outcome  when  the  dis¬ 
ease  is  absent.  When  test  outcomes  are  ordinal,  sensitivity  and  specificity  can  be  calculated  by 
dichotomizing  outcomes.  However,  a  single  sensitivity-specificity  pair  cannot  completely  describe 
the  accuracy  of  an  ordinal  test  because  both  rates  depend  on  test  stringency.  Receiver  operating 
characteristic  (ROC)  curve  analysis  accounts  for  the  tradeoff  in  these  rates  as  test  stringency 
varies.  Suppose  the  ordinal  outcome  of  a  diagnostic  test,  U,  takes  values  in  {1,2, . . .  ,iC}  with 
increasing  values  of  ti  corresponding  to  stronger  evidence  of  disease.  There  are  iC  -1- 1  possi¬ 
ble  ways  to  dichotomize  the  ordinal  test,  including  ‘all  positive’  and  ‘none  positive’,  and  each 
is  associated  with  a  sensitivity-specificity  pair.  The  empirical  ROC  curve  is  drawn  by  plotting 
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pairs  of  observed  rates,  (1-specificity)  versus  sensitivity,  and  connecting  the  K  +1  consecutive 
points  with  straight  lines.  The  empirical  ROC  curve  provides  a  simple  graphical  description  of 
test  performance. 

The  overall  accuracy  of  an  ordinal  test  can  be  summarized  by  the  area  under  the  ROC  curve 
(ADC).  The  AUC  estimates  the  probability  of  correctly  ranking  a  randomly  selected  (diseased, not- 
disease)  pair  on  the  ordinal  test  scale;  It  ranges  from  0  to  1,  with  the  value  1  corresponding  to  a 
perfect  diagnostic  test.  A  test  that  is  no  better  than  chance  has  an  AUC  equal  to  one  half.  The 
AUC  statistic  is  unbiased  and  asymptotically  normally  distributed.  The  test  of  Hg  :  AUC  =  1/2 
based  on  the  asymptotic  distribution  is  equivalent  to  a  Mann-Whitney  test  (Hanley  and  McNeil, 
1982).  The  AUC  test  is  essentially  a  test  for  differences  in  the  distribution  of  test  outcomes  in 
diseased  and  not-diseased  groups. 

3.  Bootstrap  estimation  of  sensitivity,  specificity,  AUC 

Sensitivity,  specificity,  and  the  area  under  the  receiver  operating  curve  (AUC)  are  all  generalized 
U-statistic  of  order  1:  Each  of  these  statistic  is  a  sum  of  functions  of  statistically  independent 
quantities  (Lee,  1990).  Because  sensitivity,  specificity  and  the  AUC  are  U-statistics,  bootstrap 
resampling  provides  consistent  point  and  interval  estimates  (Bickel  and  Freedman,  1981;  Arcones 
and  Gine,  1992). 

Let  ti  =  {tiuU2i  •  •  •■itim)'  be  the  vector  of  ordinal  test  outcomes  across  m  sites  for  the 
subject  and  let  d,  =  (d,i,  dj2,  ■  •  • ,  dim)'  be  the  corresponding  vector  of  true  states,  where  dij  =  1 
if  the  site  of  the  patient  is  diseased  and  dij  =  0  otherwise.  Written  in  U-statistic  form. 
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sensitivity  and  specificity  for  the  cutpoint  are: 


sensitivity^  =  —  dj)  and  specificity^  =  —  V(1  -  (pkiU,  (1  -  di))) 

f^Di  'f^D  i 

with  kernel  function  <j)k{tu  di)  =  Ej  h{tij)dij  where  4(t)  =  1  if  t  >  A;,  and  4(t)  =  0  otherwise. 
The  associated  sample  sizes  are  ud  =  EiEjd,j  and  =  Ei.Ej(l  “  dij).  Here  D  indicates 
presence  of  disease  and  D  indicates  absence  of  disease. 

The  AUC  statistic  is  given  by: 

non^ 


with  kernel  function 


1  if  iij 


2  if 


0  if  tij  <  ti>j> 


When  both  diseased  and  not  diseased  sites  can  occur  within  a  patient,  the  sum  corresponding  to 
the  AUC  statistic  includes  functions  of  correlated  pairs  of  diseased  and  not-diseased  observations, 
violating  U-statistic  properties.  However,  relatively  few  correlated  {D,  D)  pairs  are  included  in 
the  sum.  Let  Pp  be  the  patient-prevalence  of  disease,  and  let  Ps  be  the  expected  proportion  of 
sites  with  disease  given  a  patient  has  disease.  If  all  patients  with  disease  have  the  same  number 
of  affected  sites,  then  the  proportion  of  correlated  (£>,  D)  pairs  is 


(I-P3)  1 

(1  -PpPs)N' 

When  all  patients  have  a  single  disease  state,  Ps  =  1,  there  are  no  correlated  {D,D)  pairs.  The 
number  of  correlated  pairs  is  maximized  at  1/A^  when  all  N  patients  have  disease  {pp  =  1).  When 
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there  are  correlated  pairs,  the  U-statistic  properties  of  the  AUC  statistic  can  be  maintained  by 
excluding  correlated  pairs  from  AUC  sums.  In  most  cases  this  exclusion  is  unnecessary  because 
the  number  of  correlated  comparisons  quickly  becomes  negligible  as  sample  size  increases.  This 
article  examines  bootstrap  resampling  that  is  directly  applied  directly  to  AUC  statistics,  without 
excluding  correlated  pairs. 

Bootstrap  samples  are  constructed  by  stratifying  patients  on  overall  disease  state  (any  or 
none)  and  drawing  patients,  the  independent  units,  with  replacement  from  these  strata.  Re¬ 
sampling  patient-level  data  incorporates  all  sources  of  within  patient  variability.  Stratifying  the 
bootstrap  samples  by  patient-level  disease  state  corresponds  to  conditioning  on  true  disease  state, 
a  property  of  the  accuracy  statistics  sensitivity,  specificity  and  AUC,  and  reflects  the  sampling 
strategy  often  used  when  evaluating  diagnostic  test  performance.  Accuracy  statistics  are  calcu¬ 
lated  for  each  bootstrap  sample.  Point  estimates  are  simple  averages  of  statistics.  The  accuracy 
of  two  tests  can  be  compared  by  calculating  the  difference  in  accuracy  statistics  for  each  boot¬ 
strap  sample,  incorporating  between  test  correlation.  Standard  errors  are  estimated  by  observed 
standard  errors  across  bootstrap  samples.  Standard  errors  should  be  based  on  at  least  100  draws. 
Confidence  intervals  can  be  estimated  using  bootstrap  estimated  standard  errors,  with  a  normal 
approximation.  Confidence  intervals  can  also  be  estimated  using  percentiles,  though  this  requires 
at  least  1,000  bootstrap  draws. 

4.  Angiography  Study 

Contrast  angiography  (CA)  is  the  usual  method  for  mapping  vascular  occlusion  prior  to  bypass 
surgery  in  patients  with  peripheral  vascular  disease.  Magnetic  resonance  angiography  (MRA) 
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is  an  alternative  method  for  obtaining  the  same  diagnostic  information.  MRA  is  less  invasive 
than  CA  because  it  does  not  require  injection  of  contrast  materials.  The  ability  of  CA  and 
MRA  to  correctly  identify  open  vessel  segments  was  compared  using  a  prospective  study,  with 
intraoperative  angiography  used  as  the  gold  standard. (Baum  et  al,  1995)  Patients  were  evaluated 
at  15  sites  (vessel  segments).  Analyses  were  based  on  96  patients  with  peripheral  vascular 
disease  who  had  intraoperative  angiography  results  and  at  least  one  preoperative  angiographic 
tests  (eleven  patients  were  missing  an  MR  assessment).  On  average,  33%  of  each  patient’s 
vessel  segments  were  occluded.  Overall,  335  of  932  segments  with  gold  standard  information 
were  occluded  (36%).  Study  radiologists  rated  the  occlusion  of  each  vessel  segments  using  a 
five-point  scale:  1)  normal;  2)  minimal  disease  (<50%  stenosis);  3)  stenotic  (a  single  lesion  with 
>  50%  stenosis  but  not  fully  occluded);  4)  diffuse  disease  (multiple  lesions  with  >  50%  stenosis 
but  not  fully  occluded);  5)  fully  occluded.  Ratings  within  patients  were  moderately  correlated, 
with  similar  degrees  of  correlation  for  the  two  tests.  Overall,  correlation  (based  on  Kendall’s  r) 
was  0.20  for  CA  and  0.19  for  MRA.  Correlation  between  sites  with  the  same  disease  state  was 
0.49  for  CA  and  0.46  for  MRA.  Correlation  between  sites  with  different  disease  states  was  -0.34 
for  CA  and  -0.36  for  MRA.  Correlation  between  CA  and  MRA  ratings  of  the  same  site  was  0.62. 

Original  analyses  examined  both  detection  of  near-normal  (patent)  vessel  segments  (ratings 
1  and  2)  and  detection  of  open  segments  (ratings  1  through  4).  Both  methods  were  similar  in 
their  ability  to  detect  open  vessel  segments:  both  had  81%  specificity,  CA  had  an  83%  sensitivity 
and  MRA  had  an  85%  sensitivity.  In  detecting  patent  segments,  CA  was  less  sensitive  than 
MRA  (77%  versus  82%)  but  more  specific  (92%  versus  84%).  Based  on  these  descriptive  data 
and  statistical  tests  for  differences  in  odds  ratios,  the  original  investigators  concluded  that  CA 
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and  MRA  had  similar  diagnostic  ability.  Bootstrap  estimation  allows  us  estimate  AUC  statistics 
for  patent  segments,  to  examine  whether  differences  are  likely  due  to  a  threshold  effect,  and  to 
place  confidence  intervals  on  estimated  sensitivity  and  specificity.  We  report  bootstrap  percentile 
intervals  based  on  1000  bootstrap  samples.  Bootstrap  estimates  of  sensitivity  were  CA:  76%  with 
95%  Cl  (70.5,81.8)  and  MRA:  82%  with  95%  Cl  (76.8,87.0).  Bootstrap  estimates  of  specificity 
were  CA:  93%  false  positive  rate  with  95%  Cl  (89.8,95.9);  MRA:  84%  with  95%  Cl  (79.4,88.1). 
CA  and  MRA  had  similar  empirical  AUC  statistic.  The  empirical  AUC  for  CA  was  0.879  with  95% 
confidence  interval  (0.847,0.910).  The  empirical  AUC  for  MRA  was  0.874  with  95%  confidence 
interval  (0.844,0.904).  The  bootstrap  estimate  of  the  difference  in  AUC  statistics  was  0.005  with 
95%  confidence  interval  (—0.035,0.044). 

5.  Simulation  Study 

This  simulation  study  describes  characteristics  of  bootstrap  accuracy  and  compares  them  to 
Obuchowski’s  analytic  estimates  (Obuchowski,  1997).  Bootstrap  confidence  intervals  based  on 
normal  approximations  were  based  on  100  bootstrap  samples.  Bootstrap  confidence  intervals 
based  on  percentiles  were  based  on  1000  bootstrap  samples.  Comparisons  focus  on  the  observed 
coverage  of  95%  confidence  intervals  for  differences  between  two  AUC  statistics.  The  description 
of  bootstrap  estimates  also  includes  coverage  rates  for  estimated  false  positive  rates. 

Simulated  data  represent  comparisons  between  two  tests  (A  and  B)  with  outcomes  on  a  5- 
point  ordinal  scale.  Test  A  has  an  empirical  AUC  equal  to  0.8  and  specificities  equal  to  (0.5, 
0.7,0.9.0.95).  Test  B  has  the  same  specificities  and  an  AUC  statistic  equal  to  0.80  or  0.85.  The 
bootstrap's  ability  to  handle  multiple  sources  of  variability  was  evaluated  by  simulating  outcomes 
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for  two  readers  per  test.  The  overall  diagnostic  accuracy  of  each  test  was  based  on  the  average  of 
the  two  readers'  AUC  statistics.  Data  simulated  for  two  readers  assumes  that  readers  evaluating 
the  same  test  had  equal  accuracy,  with  the  same  specificities  and  the  same  AUC  statistics.  Two 
reader  bootstrap  AUC  estimates  were  calculated  by  estimating  each  reader’s  AUC  statistic  then 
averaging  these  within  each  bootstrap  sample. 

Ordinal  test  outcomes  were  simulated  by  categorizing  continuous  multivariate  normal  (MVN) 
psuedodeviates.  One  MVN  pseudodeviate  of  length  4m  was  generated  for  each  patient-observation, 
where  m  is  the  number  of  sites  within  patients.  Each  independent  MVN  pseudodeviate  repre¬ 
sents  a  single  patient’s  unobservable  continuous  test  outcome  for  2  tests  and  2  readers.  Within 
patient  correlation  was  induced  on  the  continuous  scale.  Simulations  examine  the  characteristics 
of  estimators  for  three  within  subject  correlation  structures:  independent,  compound  symmetry, 
and  disease-dependent.  Under  the  compound  symmetry  structure,  multiple  observations  within 
subjects  are  equicorrelated,  with  correlation  equal  to  0.50.  The  disease-dependent  structure  is 
identical  to  the  compound  symmetry  structure  with  one  exception:  Under  the  disease-dependent 
structure,  observations  from  sites  with  different  disease  states  (i.e.,  {D,D)  pairs)  are  negatively 
correlated,  with  correlation  equal  to  -0.50. 

Simulations  examine  3  sampling  scenarios.  Under  the  first  scenario  (small  N)  100  patients, 
50  with  disease  and  50  without,  are  evaluated  at  4  sites.  Under  the  second  scenario  (large  m) 
100  patients,  all  with  disease,  are  evaluated  at  15  sites.  Under  the  third  scenario  (large  N). 
500  patients,  250  with  disease  and  250  without,  are  evaluated  at  4  sites.  For  all  scenarios, 
patients  with  disease  are  expected  to  have  disease  at  half  of  the  sites.  The  number  of  disease¬ 
positive  sites  for  each  patient  was  simulated  using  a  binomial  random  number  generator.  Ordinal 
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ratings  were  derived  from  MVN  deviates  by  assuming  an  underlying  bivariate  normal  ROC  model 
(Hanley,  1989).  That  is,  ‘cut  points’  for  each  for  the  five  rating  categories  are  set  equal  to 
9q  =  -00,  6k  =  -  specificity^),  =  1, . . .  ,4,  and  ^5  =  +oo.  Given  n  and  6,  sensitivities 

were  sensitivity^  =  + 1^)-  Desired  empirical  AUC  statistics  were  obtained  with  ^  =1.29 

for  AUC=0.80  and  jx  =1.949  for  AUC=0.85.  For  disease  negative  sites,  the  ordinal  rating 
corresponding  to  the  MVN  deviate  y  is  equal  to  k  when  9k-i  <  y  <  Ok-  For  disease  positive 
sites,  the  MVN  deviates  are  first  shifted  by  an  appropriate  /x,  with  simulated  ratings  based  on 
categorizing  t/  +  /x. 

Simulation  results  were  based  on  5,000  simulated  data  sets  for  each  combination  of  AUCb 
(0.80  or  0.85),  sampling  scenario  (small  N,  large  m,  or  large  N),  and  correlation  structure 
(independent,  equicorrelated,  and  disease-dependent). 

6.  Simulation  Results: 

Table  1  shows  the  observed  within-patient  correlation  in  the  simulated  categorical  data.  These 
rating  data  are  inherently  correlated,  since  diseased  sites  were  more  likely  to  have  high  scores 
than  not  diseased  sites. 

Table  2  shows  coverage  rates  of  95%  confidence  intervals  for  the  difference  between  AUC^i 
and  AUCjb  based  on  Obuchowski’s  analytic  estimator,  the  single  reader  bootstrap  percentile 
interval  and  the  two  reader  bootstrap  percentile  interval.  Coverage  rates  for  normal-approximation 
bootstrap  intervals  are  not  shown  because  they  were  similar  to  percentile  intervals,  with  slightly 
poorer  coverage  properties.  In  general,  coverage  rates  of  the  percentile  interval  fell  between 
coverage  rates  for  the  analytic  and  bootstrap  percentile  intervals.  Both  the  sampling  scenario 
and  the  correlation  structure  affected  the  observed  coverage  rates,  but  true  differences  between 
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the  two  AUC  statistics  did  not  affect  coverage.  Within  the  small  N  and  large  m  sampling  schemes, 
intervals  estimated  from  data  with  compound  symmetry  correlation  tended  to  have  coverage  rates 
that  were  further  from  the  95%  level  than  estimates  based  on  independent  data  The  bootstrap 
and  estimates  had  very  similar  coverage  rates  for  the  small  N  and  large  N  scenarios.  The  analytic 
estimator  had  better  coverage  for  the  large  m  scenario.  Single  reader  and  two  reader  bootstrap 
estimates  had  similar  coverage  rates. 

The  analytic  estimator  and  the  single  reader  bootstrap  estimator  had  similar  mean  squared 
errors  (MSE’s).  Across  simulated  data  sets,  the  MSE  of  the  bootstrap  estimate  for  one  reader  was 
less  than  0.1%  higher  than  the  MSE  of  the  analytic  estimate.  The  MSE  of  two  reader  bootstrap 
estimates  were  approximately  half  the  MSE  of  either  single-reader  estimate. 

Table  4  shows  coverage  rates  of  bootstrap  percentile  interval  estimates  for  specificities.  Cov¬ 
erage  rates  were  generally  less  than  the  nominal  level,  but  improved  as  the  specificity  decreased 
from  0.95  to  0.50  and  as  the  amount  of  data  available  for  estimation  increased  Percentile  intervals 
had  better  coverage  rates  than  asymptotic  normal  intervals  (not  shown).  When  specificity  was 
0.95,  a  few  (less  than  1%)  of  the  asymptotic  normal  bootstrap  intervals  fell  outside  of  the  (0, 1) 
range. 

7.  Discussion 

Diagnostic  evaluation  often  involves  testing  patients  at  multiple  sites.  Bootstrap  and  analytic 
estimation  methods  allow  simple  comparisons  of  AUC  statistics  based  on  clustered  patient  data. 
These  methods  are  asymptotically  consistent.  However,  evaluation  of  diagnostic  tests  often  in¬ 
volve  relatively  small  sample  sizes.  We  used  a  simulation  study  to  evaluate  the  small  sample 
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characteristics  of  Obuchowski’s  analytic  AUC  estimator  and  bootstrap  AUC  estimators  applied  to 
ordinal  test  data.  When  comparing  two  tests  with  one  reader  per  test,  the  bootstrap  and  analytic 
estimators  had  very  similar  performance.  Both  methods  produced  confidence  intervals  with  ob¬ 
served  coverage  rates  below  the  nominal  level.  Coverage  rates  of  bootstrap  percentile  confidence 
intervals  were  nearly  identical  to  asymptotic  normal  intervals  for  AUC  statistics.  However,  per¬ 
centile  intervals  had  better  coverage  than  asymptotic  normal  intervals  for  proportions.  Although 
these  methods  are  asymptotically  consistent,  simulations  suggest  that  when  test  outcomes  are 
ordinal  and  tests  are  relatively  accurate,  large  samples  are  needed  before  asymptotic  results  hold. 

The  simulations  presented  in  this  article  demonstrated  poorer  performance  for  Obuchowski’s 
estimator  than  was  originally  reported.  There  are  important  difference  in  the  simulations  in  this 
article  and  those  presented  by  Obuchowski.  Two  key  differences  are  site-level  prevalence  of  disease 
and  the  scale  of  the  test  outcome.  In  the  smallest  sample  setting,  patient  level  disease  prevalence 
was  50%  and  among  patients  with  disease  an  average  of  50%  of  sites  were  affected,  resulting 
in  a  25%  overall  site-level  prevalence.  Obuchowski  simulated  data  with  an  overall  50%  site-level 
prevalence.  Obuchowski  also  generated  outcomes  on  a  continuous  0  to  100  scale,  rather  than 
the  5-point  ordinal  scale  more  commonly  found  in  radiology.  The  continuous  scale  allows  more 
variability  in  true  positive  (p)  and  false  positive  {fp)  rates.  A  comparison  between  continuous 
scales  is  also  more  informative  than  a  comparison  between  corresponding  ordinal  scales  because 
there  are  no  ties. 

Simulation  studies  examine  the  behavior  of  estimators  in  specific  settings.  The  simulation 
study  examined  plausible  scenarios.  In  radiology  research  tests  outcomes  are  often  measured 
on  5-point  ordinal  scale,  and  these  tests  can  be  highly  accurate  withe  relatively  high  specificity. 
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Overall  sample  sizes  are  often  small,  including  100  or  fewer  subjects.  There  were  some  important 
assumptions  that  may  limit  the  conclusions  that  can  be  drawn  from  simulation  study  findings. 
One  important  assumption  made  for  these  simulations  was  that  the  two  tests  compared  had  the 
same  underlying  sensitivities.  Perhaps  the  strongest  assumption  made  for  simulated  data  was 
that  when  two  readers  were  involved  they  each  had  identical  ROC  curves.  In  real  life  settings, 
readers  ROC  curves  will  almost  certainly  differ.  In  this  context,  the  investigator  must  determine 
whether  there  is  value  in  the  estimate  of  the  average  AUC  statistic. 
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Table  1.  Average  observed  correlation  of  simulated  rating  data  (Kendall’s 
r)  between  tests  when  both  have  AUC  =  0.80  and  between  readers  evaluating 
Test  A. 


correlation  structure 


type  of  correlation 

independence 

compound 

symmetry 

disease 

dependent 

between  sites,  sanne  disease  state 

0.199 

0.478 

0.478 

between  sites,  different  disease  state 

0.164 

0.457 

0.457 

between  tests,  same  site 

0.000 

0.348 

-0.240 

between  readers,  same  site 

0.200 

0.478 

0.478 
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Table  2.  Observed  coverage  rates  of  95%  confidence  intervals.  In  all 

cases  AUCa  =  0.8.  Small  N  simulations  generate  data  from  50  diseased  patients 
and  50  not  diseased  patients,  each  evaluated  at  4  sites  by  both  tests.  Large  m 
simulations  generate  data  from  100  diseased  patients,  each  evaluated  at  15  sites 
by  both  tests.  Large  N  simulations  generate  data  from  250  diseased  patients 
and  250  not  diseased  patients,  each  evaluated  at  4  sites  by  both  tests. 


correlation 

sampling  design 

structure 

estimator 

small  N 

large  m 

large  N 

independent 

analytic 

0.934 

0.932 

0.946 

1  reader  bootstrap 

0.939 

0.944 

0.946 

2  reader  bootstrap 

0.935 

0.947 

0.949 

equicorrelated 

analytic 

0.936 

0.934 

0.952 

1  reader  bootstrap 

0.939 

0.946 

0.950 

2  reader  bootstrap 

0.939 

0.947 

0.950 

disease  dependent 

analytic 

0.939 

0.936 

0.950 

1  reader  bootstrap 

0.940 

0.943 

0.948 

2  reader  bootstrap 

0.938 

0.944 

0.950 
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Table  3.  Observed  coverage  rates  of  95%  confidence  intervals.  In  all 

cases  AUCa  =  0.8.  Small  N  simulations  generate  data  from  50  diseased 
patients  and  50  not  diseased  patients,  each  evaluated  at  4  sites  by  both 
tests.  Large  m  simulations  generate  data  from  100  diseased  patients,  each 
evaluated  at  15  sites  by  both  tests.  Large  N  simulations  generate  data  from 
250  diseased  patients  and  250  not  diseased  patients,  each  evaluated  at  4  sites 
by  both  tests. 


correlation  structure 

small  N  large  m  large  N 

fp=0.95 

independent  observations 

asymptotic  normal 

0.929  0.936  0.941 

percentile 

0.942  0.941 

equicorrelated,  correlation=0.5 

asymptotic  normal 

0.912  0.923  0.936 

percentile 

0.926  0.926 

correlation  dependent  on  disease  state 

asymptotic  normal 

0.911  0.912  0.940 

percentile 

0.926  0.928 

^=0.50 

independent  observations 

asymptotic  normal 

0.942  0.942  0.946 

percentile 

0.944  0.949 

equicorrelated,  correlation=0.5 

asymptotic  normal 

0.941  0.942  0.948 

percentile 

0.948  0.947 

correlation  dependent  on  disease  state 

asymptotic  normal 

0.941  0.941  0.950 

percentile 

0.948  0.941 
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A  hierarchical  regression  approach  to 
meta-analysis  of  diagnostic  test  accuracy  evaluations 


Summary 

An  important  quality  of  meta-analytic  models  for  research  synthesis  is  their  ability  to  account 
for  both  within-  and  between-  study  variability.  Currently  available  meta-analytic  approaches  for 
studies  of  diagnostic  test  accuracy  work  primarily  within  a  fixed-effects  framework.  In  this  paper 
we  describe  a  hierarchical  regression  model  for  meta-analysis  of  studies  reporting  estimates  of 
test  sensitivity  and  specificity.  The  model  allows  more  between-  and  within-study  variability  than 
fixed-effect  approaches,  by  allowing  both  test  stringency  and  test  accuracy  to  vary  across  studies. 
It  is  also  possible  to  examine  the  effects  of  study  specific  covariates.  Estimates  are  computed  using 
Markov  Chain  Monte  Carlo  simulation,  allowing  flexibility  in  the  choice  of  summary  statistics.  We 
demonstrate  our  modelling  approach  using  a  recently  published  meta-analysis  comparing  three 
tests  used  to  detect  nodal  metastasis  of  cervical  cancer. 

keywords:  summary  ROC  curve,  Bayesian  methods,  sensitivity,  specificity. 
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1.  Introduction 


The  need  for  systematic  review  and  synthesis  of  published  evidence  on  the  accuracy  of  diagnostic 
tests  has  increased  in  recent  years.  The  information  from  such  reviews  is  a  key  element  of  clinical 
and  health  policy  decision  making  regarding  the  use  of  diagnostic  tests;  it  is  also  essential  for 
guiding  the  process  of  technology  development  and  evaluation  in  diagnostic  medicine  [1,  2], 

Statistical  methods  for  meta-analysis  of  diagnostic  test  evaluations  have  focused  on  the 
analysis  of  studies  reporting  estimates  of  test  sensitivity  and  specificity,  the  most  commonly 
used  measures  of  diagnostic  performance,  and  have  worked  within  the  fixed-effects  framework 
[1,  2,  3,  4,  5,  6,  7,  8,  9].  A  fundamental  concern  in  the  synthesis  of  such  studies  is  derivation  of 
summary  measures  of  test  performance.  These  measures  must  account  for  the  tradeoff  between 
sensitivity  and  specificity  as  the  threshold  for  positivity  varies  along  some  explicit  or  latent  scale. 
This  tradeoff  has  been  widely  recognized  in  the  evaluation  of  diagnostic  tests  and  has  led  to  the 
development  of  Receiver  Operating  Characteristic  (ROC)  methodology  [10].  In  the  context  of 
meta-analysis,  simple  averaging  or  pooling  across  studies  can  provide  misleading  conclusions,  as 
can  be  readily  seen  from  a  simple  example.  If  three  studies  report  the  following  estimates  of  test 
sensitivity  and  specificity:  (.10,  .90),  (.80,  .80),  and  (.90,  .10),  the  average  pair  of  sensitivity  and 
specificity  is  (.60,  .60)  and  lies  completely  outside  the  domain  of  the  original  studies  (see  also 
[1.  2,  6]). 

Differences  in  positivity  threshold  constitute  an  important  source  of  variation  across  studies 
evaluating  a  diagnostic  modality.  Study  characteristics,  such  as  technical  aspects  of  the  diagnostic 
test,  patient  and  disease  cohorts,  study  settings,  experience  of  readers,  and  sample  size  are  also 
potential  contributors  to  between-studies  variations  in  the  estimates  of  diagnostic  performance. 
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In  the  fixed  effects  setting,  regression  models  have  been  proposed  for  exploring  these  sources  of 
variability  [4,  5].  The  use  of  regression  models  provides  a  flexible  and  powerful  framework  for 
meta-analysis.  However,  the  number  of  covariates  that  can  be  accommodated  in  such  models  is 
limited.  In  addition,  these  fixed-effects  regression  models  may  not  provide  realistic  accounts  of 
the  uncertainty  associated  with  covariate  estimates. 

In  this  paper,  we  expand  on  earlier  work  [5]  and  present  a  hierarchical  model  formulation  of 
the  problem  of  combining  information  across  studies  reporting  estimates  of  test  sensitivity  and 
specificity.  The  structure  of  the  model  is  similar  to  that  of  models  proposed  for  the  meta-analysis 
of  treatment  studies  [11,  12,  13].  The  observed  variation  is  partitioned  into  within-  and  between 
studies  components.  Each  component  consists  of  a  systematic  part  and  a  random  part,  with  the 
former  attributed  to  covariates  and  the  latter  to  unexplained  variation.  The  hierarchical  model 
makes  it  possible  to  pool  information  across  studies  and  derive  smoothed  estimates  of  covariate 
effects,  components  of  variance  and  individual  study  quantities.  In  addition,  simple  extensions  of 
the  hierarchical  structure  can  incorporate  patient-level  information  within  each  study,  when  such 
information  is  available. 

We  present  our  approach  using  data  from  a  recently  published  meta-analysis  comparing  the 
diagnostic  performance  of  three  imaging  modalities  for  the  detection  of  lymph  node  metastases  in 
women  with  cervical  cancer  [14].  In  section  2  we  survey  fixed  effects  approaches  to  the  problem. 
The  hierarchical  regression  model  is  presented  in  section  3.  We  take  a  fully  Bayesian  approach  to 
model  fitting  and  checking  and  use  Markov  Chain  Monte  Carlo  estimation  techniques.  Technical 
issues  are  discussed  in  section  3  and  the  analysis  of  the  example  is  presented  in  section  4.  The 
final  section  summarizes  our  methodological  and  subject  matter  conclusions. 
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2.  Meta-anaiytic  models  for  diagnostic  test  data 

The  simplest  setting  for  the  methods  discussed  in  this  paper  involves  meta-analyses  in  which  each 
of  m  studies  contributes  a  vector  Zi  of  study-level  covariates  (i  =  1,  and  a  2x2  table  of 
summary  data,  showing  the  agreement  between  the  binary  test  result  and  the  definitive  disease 
information  ("gold  standard").  We  will  use  the  following  notation: 


Test: 


0=:no  l=yes 


Truth:  no 
yes 


2/iOO 

Z/ioi 

2/ilO 

2/ill 

nn 


The  observed  rates  of  true  and  false  positive  test  results  are  then  defined  as  TPi  =  yni/nn 
and  FPi  =  ym/niQ  .  In  some  meta-analyses  more  than  one  2x2  table  is  available  from  each 
study.  For  example,  patients  may  be  examined  using  several  tests  in  a  study,  leading  to  correlated 
binary  test  results  studies. 

2.1  Summary  ROC  (SROC)  curve 

In  the  absence  of  patient  and  study  level  covariates,  a  simple  graphical  summary  of  test  accuracy 
is  provided  by  the  summary  ROC  curve  (SROC)  [4].  The  curve  is  constructed  by  computing  the 
quantities 


A  =  logit(rPi)  -  logit(FPi)  and  =  logit(rPi)  +  logit(PPi) 


for  each  study  and  fitting  the  linear  model 


A  —  a  -F  hSi  "F  Cj 


(1) 
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where  Ci  is  random  error.  The  model  can  be  fitted  using  ordinary  or  weighted  least  squares, 
or  robust  regression  methods.  Weights  can  be  used  to  account  for  bfetween-study  differences  in 
overall  sample  size  or  precision.  However,  weights  cannot  simultaneously  capture  differences  in 
sample  size  within  the  disease- positive  (nn)  and  disease-negative  (njo)  groups.  These  two  sample 
sizes  affect  the  precision  of  estimated  TP  and  FP  rates  independently.  In  practice,  weighted 
and  unweighted  models  can  produce  very  different  results,  and  there  is  no  clear  way  to  choose 
between  these  models. 

Using  the  estimates  of  a  and  b,  a  plot  of  the  summary  ROC  curve  can  be  drawn,  with  FP 
on  the  x-axis  and  TP  on  the  y-axis.  This  SROC  model  corresponds  to  the  assumption  that 
the  observed  differences  across  studies  result  from  different  thresholds  for  test  positivity.  The 
summary  curve  is  symmetric  if  6  =  0,  implying  constant  log-odds  across  the  studies  under  review. 
Study  level  covariates  can  be  incorporated  in  straightforward  manner  into  equation  (1)  to  provide 
an  exploratory  analysis  of  the  effects  of  study  characteristics.  Several  summaries  of  the  SROC 
curve  have  been  proposed  and  can  be  used  to  make  comparisons  between  modalities.  However, 
the  SROC  model  does  not  account  for  error  in  Sj  and  this  can  bias  parameter  estimates[15]  and 
summaries  that  are  functions  of  these  estimates.  Further  exploration  is  needed  to  determine  the 
effect  of  ignoring  error  in  Sj  on  both  point  estimation  and  coverage  rates  of  estimated  confidence 
intervals. 

An  alternative  approach  to  constructing  an  SROC  curve  was  proposed  by  Kardaun  and 
Kardaun[3],  who  assumed  that  (logit(TFi),  logit(FPi))  follows  a  bivariate  normal  distribution 
and  postulated  a  linear  relationship  between  the  two  components  of  the  bivariate  mean.  Profile 
likelihood  is  used  to  derive  estimates  of  the  slope  and  intercept  in  this  model,  which  includes 
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variability  in  both  rates.  Difficulties  incorporating  covariates  into  this  model  limit  its  usefulness. 

2.2  Binomial  regression  model 

A  regression  model  for  the  meta-analysis  of  {TPi,  FPi)  pairs  was  first  discussed  in  [5]  and  was 
motivated  by  the  ordinal  regression  formulation  of  ROC  analysis  [16, 17, 18].  In  brief,  \fW  denotes 

the  degree  of  suspicion  about  the  presence  of  an  abnormality,  elicited  on  an  ordinal  categorical 

scale  with  J  categories,  the  parametric  ROC  model  is  equivalent  to  the  ordinal  regression  model 
g{P\W  >  ji|X])  =  {6j  aX)  exp(— where  X  is  a  covariate  denoting  the  (binary)  true 
disease  status.  The  conceptual  basis  of  the  model  is  an  assumption  that  the  observed  responses 
W  represent  a  categorization  of  a  latent  variable,  with  distribution  corresponding  to  the  link 
function,  g{-).  The  probit  link  implies  a  Gaussian  latent  variable  and  is  commonly  used  for  single¬ 
study  receiver  operating  characteristic  analysis  [10].  We  use  the  logit  link  throughout  this  article 
because  under  the  logit  model,  regression  parameters  estimate  log-odds  ratios. 

As  discussed  in  [5],  an  ordinal  regression  model  with  a  logistic  link  and  J  =  2  can  be  used 
in  the  meta-analysis  of  studies  reporting  pairs  of  {TPi,  FPi).  Under  this  model,  the  numbers 

of  positive  tests  from  each  study,  yiji,i  =  =  0,1  are  assumed  to  follow  binomial 

distributions,  ~  Binomial(nij,7rij),  in  which  the  probability  of  a  positive  test  modelled  as: 

TTy  =  logit“^((^i  +  aXij)e~^^*^).  (2) 

As  in  the  ROC  context,  the  6i’s  will  be  called  the  "positivity  criteria”  (or  "cutpoint  parameters”), 
a  the  "accuracy  parameter”  and  ^  the  "scale  parameter”.  The  tradeoff  between  TP  and  FP  is 
modelled  through  their  joint  dependence  on  9.  The  binomial  regression  model  is  estimated  by 
maximum  likelihood  and  accounts  for  error  in  both  TPi  and  FPi  rates. 
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It  can  be  readily  seen  that  the  binomial  regression  model  (2)  implies  a  linear  relationship 
between  logit(TP)  and  logit(FP).  This  linearity  is  a  basic  assumption  in  the  two  SROC  models 
discussed  earlier  and  implies  a  natural  correspondence  between  SROC  and  the  binary  regression 
analysis.  Like  the  SROC  model,  the  simple  binary  regression  model  (2)  assumes  that  observed 
differences  across  studies  result  from  different  positivity  thresholds  (Oi).  Tests  are  assumed  to 
have  the  same  accuracy  and  scale  parameter  across  studies.  In  addition,  as  discussed  in  [5],  more 
elaborate  versions  of  (2)  can  be  formulated,  in  which  parameters  other  than  9  vary  across  studies 
and  study  level  covariates  are  included.  In  practice,  making  choices  among  such  elaborate  models 
is  not  a  straightforward  matter.  An  important  consideration  in  making  such  choices  is  to  ensure 
that  the  resulting  model  is  identifiable. 

3.  Hierarchical  regression  analysis 
3.1  Model 

Hierarchical  regression  analysis  extends  the  binomial  regression  model  to  more  fully  account  for 
both  within  and  between  study  variability  in  TP  and  FP  rate.  The  model  allows  the  inclusion 
of  patient-  and  study-level  covariates,  if  such  information  is  available,  and  has  the  following 
structure: 

Level  I  {Within-study  variation)  The  number  of  positive  tests  from  the  z-th  study,  t/joi  and 
ym,  are  assumed  to  follow  binomial  distributions,  with  the  probability  of  a  positive  test  given  by: 

TTij  =  logit“^[((9i  +  aiXij)e~^^'^]  (3) 

where  Xij  denotes  the  true  disease  status  for  cases  in  the  ij-th  cell.  Under  this  hierarchical  SROC 
model  (HSROC),  both  positivity  criteria  {6i)  and  accuracy  parameters  (aj)  are  allowed  to  vary 
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across  studies. 


Level  II  {Between-study  variation)  Study-level  parameters  in  (3),  Oj  and  6i,  are  assumed  to 
be  Normally  distributed,  with  mean  determined  by  a  linear  function  of  study-level  covariates.  In 
the  case  of  a  single  covariate  Z  affecting  both  the  cutpoint  and  accuracy  parameters,  the  model 
can  be  written  as; 


ei\e,j,Zi,aj  ~ 

A,  Zi,cT^  ~ 


Nie  +  ^Zi,al) 

N{A  +  XZi,al) 


>  conditionally  independent 


The  coefficients  7  and  A  model  systematic  differences  in  positivity  criteria  and  accuracy  across 
studies,  due  to  the  covariate  Z.  However,  more  general  formulations  of  the  model  can  be 
considered  in  which  more  than  one  covariate  is  included  and  different  covariates  are  used  for 
‘cutpoint’  and  ‘accuracy’  regression  equations. 

Level  III  The  specification  of  the  hierarchical  model  is  completed  by  the  choice  of  prior 
distributions  for  the  remaining  unknown  parameters.  In  particular,  we  chose: 


©  ~  Uniform[/zei,^02];  7  ~  Uniform[/i-yi,/z^2];  o-|  ~ 

A  ~  Uniform [/XQi,)Ua2];  A  ~  Uniform [yUAi,AiA2];  ~  ^"^(^01,^02) 

P  ~  Uniform[)U^i,  ^^2] 

The  parameters  0,  A,  /?,  7,  A,  and  <7^  are  assumed  to  be  mutually  independent.  The 
parameters,  ^la,  fJ-p,  are  assumed  to  be  fixed  and  are  chosen  to  reflect  expected 

ranges. 

Summary  ROC  (SROC)  curves  can  be  derived  using  the  expected  values  of  A  -h  AZ  and  0. 
If  true  disease  state  is  coded  |  for  disease  positive  cases  and  —  ^  for  not  diseased  cases,  then  for 
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a  given  value  of  the  covariate  Zi,  the  model-based  true  positive  rate  can  be  expressed  as: 

TP{FP)  =  logir^  ((logit(FP)e^(^)/2  -1-  F(A  + 

The  SROC  curve  is  drawn  by  plotting  {FP,TP{FP))  for  FP  G  [0, 1].  Extrapolation  beyond  the 
available  data  can  be  discouraged  by  plotting  curves  only  over  the  observed  range  of  FP. 

3.2.  HSROC  Model  fitting 

Inference  from  the  HSROC  model  is  based  on  the  posterior  distributions  of  model  parameters. 
Because  the  models  we  consider  are  not  conjugate,  closed  form  expressions  for  posterior  distribu¬ 
tions  do  not  exist.  Posterior  quantities  are  estimated  by  simulating  observations  from  the  posterior 
distribution  using  Markov  Chain  Monte  Carlo  (MCMC)  simulation  [19].  These  simulated  values 
from  the  full  posterior  distribution  are  used  to  estimate  marginal  distributions  of  interest,  such 
as  posterior  distributions  of  particular  parameters  or  functions  of  parameters. 

To  enable  estimation,  each  covariate  was  centered  at  zero.  When  estimating  the  fixed  effect 
binomial  regression  model,  covariate  centering  is  required  for  model  identifiability  [16].  Under  the 
hierarchical  regression  model,  centering  the  covariate  vector  helps  to  reduce  correlation  between 
consecutive  draws. 

3.2.1  Conditional  distributions 

The  conditional  distributions  of  Level  II  parameters  (0,  A,  (j|,  o^,  7  and  A)  are  standard  distribu¬ 
tions  or  truncated  versions  of  standard  distributions.  For  example,  the  conditional  distribution  of 
A  given  A,  2,  cr^,  a,  and  Ha  is  proportional  to  a  Normal  distribution  with  mean  {oti  —  \Zi)lm 
and  variance  o^jm  over  the  range  [na\,  Variance  parameters,  and  <r|,  have  conjugate 
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priors,  so  that 


1  m 

(4|A,  z,  q,  A,(,)  ~  r-‘((fai  +  m/2).  (-  £(«,  -  A  -  XZif  +  l/Ur')-  (4) 

^  i=l 

The  conditional  distributions  of  0  and  aj  are  analogous.  Finally,  the  conditional  distribution 
(A|A,  cTq,  a,  fix)  is  proportional  to  a  normal  distribution  with  mean 


X)i=i  Zi{ai  -  A) 

EZiZf 

and  variance  cr^/iY^i  Zf)  over  the  range  [iixi,fi\2]- 

The  conditional  distributions  of  Level  I  parameters  {6i,  ccj  and  P)  are  not  standard.  The 
conditional  distribution  of  study  specific  accuracies,  {ai\yi,ni,Zi,A,a^,Oi,P),  is  a  Binomial- 
Normal  product, 

-p  jS 

The  conditional  distribution  of  Oi  has  similar  form.  The  conditional  distribution  of  the  scale 
parameter,  {P\y,n,  Z,9,a,  fi/s),  is  proportional  to  the  product  of  2m  Binomials, 


nn  (?)<"(! 

i=ij=i  \yij/ 


-Vij) 


with  positive  probability  over  the  range  [fijn,  /U/j2]- 
3.2.2  Metropolis  steps 

We  simulated  draws  from  the  nonstandard  conditional  distributions  of  P,  6i  and  ai  using  an 
adaptive  Metropolis  step  [20].  The  Metropolis  algorithm  works  by  simulating  candidate  draws 
from  a  jumping  distribution.  Candidate  draws  are  accepted  if  they  increase  the  posterior  density, 
otherwise  the  chain  stays  at  the  current  draw.  The  scale  parameter  {P)  was  sampled  using  a 
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univariate  Normal  jumping  distribution.  Study-level  parameters  {Oi,ai)  were  jointly  sampled  using 
a  bivariate  Normal  jumping  distribution.  Normal  variance  and  variance-covariance  matrices  were 
calculated  using  a  multiple  of  the  inverse  information  matrix  evaluated  at  the  current  values  of 
unknown  parameters.  Because  low  rejection  probabilities  can  indicate  that  the  jumping  distrib¬ 
ution  is  underdispersed  relative  to  the  target  distribution,  covariance  matrices  were  inflated  so 
that  rejection  rates  varied  between  20%  and  40%  for  all  parameters  [21], 

3.2.3  Choice  of  priors 

Prior  ranges  for  0,  A  and  /?  should  be  chosen  to  reflect  subject  matter  knowledge  about  the 
diagnostic  modalities  under  review.  In  general,  the  interval  [-10, 10]  covers  all  reasonable  values 
of  ©.  Similarly,  the  interval  [-5, 5]  covers  all  reasonable  values  of  /?.  Because  we  expect  positive 
test  results,  indicating  disease,  to  be  more  common  among  patients  with  disease,  the  interval 
[—2, 20]  covers  all  reasonable  values  of  A. 

Selection  of  the  Inverse  Gamma  priors  for  the  between-study  variance  parameters,  cr|  and  <7^, 
is  more  difficult.  The  goal  in  making  this  choice  is  to  select  a  relatively  diffuse  distribution,  which 
nevertheless  does  not  assign  much  probability  to  very  large  values  of  the  variance  parameters. 
For  example,  although  a  r“^(0.1, 1)  is  quite  diffuse,  it  assigns  unduly  large  weight  to  large  values 
of  cr;  the  lower  quartile  of  the  r“^(0.1,l)  distribution  is  28.35.  We  chose  a  r“^(l,2)  prior  for 
variance  parameters  because  this  covers  the  expected  range  of  variability  in  these  data.  Quartiles 
of  the  r“^(l,2)  distribution  are  1.44,  2.89,  and  6.96.  The  probability  that  an  r~^(l,2)  random 
variable  is  greater  than  9  is  0.20. 

3.2.4  Parameter  estimation  The  goal  of  estimation  is  description  of  the  posterior  distribution  of 
model  parameters  and  summary  statistics  that  are  functions  of  model  parameters.  Posterior  95% 
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credible  intervals  were  estimated  by  empirical  2.5%  and  97.5%  posterior  percentiles  of  simulated 
draws.  The  mode  of  symmetric  posterior  distributions  was  approximated  by  the  mean  value 
across  simulated  draws  (i.e.,  0,  A,  and  (5).  The  mode  of  asymmetric  posterior  distributions  was 
approximated  by  the  median  value  across  simulated  draws  (i.e.,  Uq  and  tr^). 

3.2.5  Assessing  convergence 

Estimation  was  based  on  draws  from  several  chains  started  at  extreme  points  in  the  parameter 
space.  The  CODA  program[22]  was  used  to  evaluate  the  convergence  to  the  target  distribution. 
We  relied  primarily  on  examination  of  trace  plots  and  estimates  of  scale  reduction  proposed  by 
Gelman  and  Rubin[21].  The  scale  reduction  statistic  is  essentially  the  ratio  of  the  between  chain 
variance  to  the  within  chain  variance. 

3.2.6  Diagnostics 

Diagnostic  statistics  were  used  to  evaluate  possible  model  misspecification,  overall  goodness  of 
fit,  and  to  identify  of  outlying  and  possibly  influential  data  points.  Our  approach  roughly  follows 
the  suggestions  of  Weiss  [23] . 

Checks  for  model  misspecification  were  restricted  to  evaluation  of  the  prior  distributions  for 
study-specific  parameters  6  and  a.  Recall  that  we  assume  that  both  statistics  are  normally 
distributed.  Under  the  exchangeable  model  (i.e.  7  =  A  =  0),  the  sums  of  squares  Sg  = 
'EiiOi  -  and  =  Ei(ai  -  A)^/a^  should  follow  a  a  distribution  with  m  degrees  of 

freedom,  where  m  is  the  number  of  studies  in  the  meta-analysis.  Under  the  nonexchangeable 
model  sums  of  squares  are  of  the  form  Sa  =  Z)i(ai  —  A  —  XZiY/a"^.  Because  large  values 
of  tail  probabilities  suggest  misspecification  of  prior  distributions,  we  estimate  p{a)  =  P{Sa  < 
xl,0.025)  +  P{Sc  >  Xm, 0.975).  and  p{e)  =  P{Sg  <  xL,om5)  +  PiS9  >  Xm, 0.975).  to  evaluate  these 


13 


priors. 


Global  goodness  of  fit  can  be  evaluated  using  two  chi-square  discrepancy  statistic.  The  first 
is  based  on  estimated  counts: 

n2  _  ^  {Viji  ~  £'(j/tji|model,  data)) 

count  ~  £;(^iji|model,  data) 

where  yiji  is  the  number  of  subjects  testing  positive  in  not-diseased  [j  =  0)  and  diseased  {j  =  1) 
groups,  i^count  compared  to  a  xlm  distribution.  The  second  global  goodness  of  fit  statistics 
is  based  on  continuity-corrected  log-odds  ratios: 

^2  _  {\ogiORcc)i  -  £^(log(Oficc)»|model.  data))^ 

i  ^t;ar(log(Oi?cc)i|model,  data) 

where  log(Oi2cc)i  is  the  observed  continuity  corrected  log-odds  ratio  for  the  i-th  study. 

Outliers  and  potentially  influential  points  were  identified  using  plots  of  sensitivity  versus  speci¬ 
ficity  and  by  examining  chi-square  residuals,  e.g.,  {yij—E{yij\mode\,  data))^/£?(yyj model,  data)). 
The  sensitivity  of  the  model  to  potentially  outlying  and  influential  points  can  be  examined  by 
removing  these  points  and  re-estimating  parameters. 

4.  Example:  Evaluation  of  Lymph  Node  Metastases 
4.1.  Data 

To  demonstrate  the  hierarchical  model,  we  reanalysed  data  from  a  published  meta-analysis  of 
diagnostic  imaging  tests  used  to  detect  lymph  node  metastasis  in  patients  with  cervical  cancer 
[14].  This  study  compared  three  tests  for  detection  of  lymph  node  metastasis:  lymphangiography 
(LAG),  computed  tomography  (CT),  and  magnetic  resonance  imaging  (MR).  Data  were  combined 
from  37  studies,  of  which  17  examined  LAG,  19  examined  CT  and  10  examined  MR.  Nine  studies 
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examined  more  than  one  test.  Observed  true  positive  and  false  positive  rates  are  shown  in 


Figure  1. 


[  Figure  1  about  here  ] 


The  original  analysis  by  Scheidler  and  colleagues  used  a  fixed  effect  SROC  [4],  Q*  statistics, 
and  likelihood  ratio  statistics  to  compare  tests.  The  Q*  statistic  corresponds  to  the  estimated 
true  positive  rate  at  the  point  on  the  SROC  curve  where  the  sensitivity  is  equal  to  the  specificity 
of  the  test.  SROC  curves  were  estimated  separately  for  each  test  type.  In  addition,  likelihood 
ratio  statistics  were  used  to  describe  the  post-test  change  in  odds  of  disease.  The  likelihood 
ratio  negative  (LR“)  estimates  the  post-test  odds  of  disease  given  a  negative  test.  The  likelihood 
ratio  positive  (LR'*')  estimates  the  post-test  odds  of  disease  given  a  positive  test.  Scheidler  et  al 
concluded  that  the  three  test  had  similar  diagnostic  performance.  Although  the  analysis  did  not 
find  statistically  significant  differences  among  the  modalities,  the  authors  noted  that  MR  seemed 
to  perform  somewhat  better  than  CT  or  LAG. 

4.2  HSROC  Computations 

We  used  HSROC  analysis  to  derive  summaries  of  the  diagnostic  performance  of  the  three  modal¬ 
ities,  allowing  different  expected  cutpoint  (0)  and  accuracy  (A)  parameters  for  each  modality. 
Because  the  shape  of  the  three  SROC  curves  looked  different  we  allowed  separate  scale  parame¬ 
ters  (/3)  for  each  test.  Because  the  spread  of  the  observed  points  the  three  SROC  curves  seemed 
to  differ  across  modalities,  we  allowed  separate  variance  parameters  {aj,  cr^)  for  each  modality. 
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In  these  analyses,  we  coded  disease  state  as  +|  for  disease  positive  cases  and  for  disease 
negative  cases. 

The  model  was  estimated  using  the  combined  set  of  data  from  all  modalities  but  did  not 
explicitly  include  correlation  terms  for  data  derived  from  studies  that  compared  two  or  all  three 
of  the  modalities.  In  particular,  2  studies  examined  CT  and  LAG,  4  studies  examined  CT  and 
MR,  and  2  studies  examined  CT  twice.  Although  it  is  possible  to  extend  the  model  to  cover 
such  correlations,  the  cross-tabulated  data  from  studies  evaluating  more  than  one  modality  were 
not  available.  Because  we  expect  positive  correlation  between  diagnostic  test  results,  we  expect 
that  ignoring  this  correlation  could  cause  a  slight  conservative  bias  in  comparisons  between  tests. 
Results  from  the  combined  analysis  were  compared  to  those  from  analyses  conducted  separately 
for  each  modality. 

The  sampler  was  run  using  8  independent  MCMC  chains.  Experimental  runs  showed  that 
the  sampler  was  slowly  mixing.  To  ensure  coverage  of  the  target  distribution,  estimation  was 
based  on  multiple  sequences  with  overdispersed  starting  points.  Because  of  high  between-draw 
correlation,  every  50*'^  iteration  was  saved  from  each  sequence  of  100,000  simulated  draws. 
Metropolis  covariance  parameters  were  updated  every  1,000  iterations  to  maintain  rejection  rates 
between  20%  and  40%.  Eight  different  chains  were  run,  with  starting  points  based  on  /?,  © 
and  A:  G  {-2.5, 2.5},  ©("^  G  {-5,5},  and  G  {-1,10}.  Starting  values  for  the  prior 

variability  of  study  specific  cutpoints  (o-|)  and  accuracies  (cr^)  were  set  to  9,  nearly  half  of  what 
we  believe  is  a  reasonable  range  for  9i  and  All  parameters  had  estimated  scale  reduction 
statistics[20]  that  were  between  1.00  and  1.09  and  most  were  between  1.00  and  1.01,  indicating 
that  the  sampler  had  converged.  All  saved  iterations  were  used  to  evaluate  convergence,  but  the 
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first  5,000  were  excluded  from  point  and  interval  estimation. 

Summary  statistics  were  calculated  for  each  draw  of  the  sampler,  and  these  values  were  used 
to  estimate  their  posterior  modes  and  95%  credible  intervals.  The  summary  statistics  we  used 
were  the  overall  likelihood  ratio  positive,  overall  likelihood  ratio  negative,  overall  TP  rate,  and 
overall  FP  rate.  These  overall  performance  estimates  were  calculated  for  each  test.  Statistics 
describing  overall  test  performance  were  functions  of  parameters  describing  performance  across 
studies:  the  level  I  model  parameter  (5  and  level  II  model  parameters  describing  expected  values 
of  study-level  parameters.  For  example,  overall  TP  and  FP  rates  for  CT  are  estimated  from: 

TPct  =  logit“^  [(0CT  +  AcT/2)e"’^‘^^/^] 

FPct  =  logit  ^[(©cT  ~  A(77'/2)e^‘^'^^^] 

The  {TPct,  FPct)  ps'i*  summarizes  the  overall  performance  of  the  CT.  Likelihood  ratio  statistics 
were  estimated  using  overall  TP  and  FP  for  each  test,  for  example,  LRqt  =  TPct /FPct  snd 
LRct  =  (1  -  TPct)/{1  -  FPct). 

Probability  estimates  were  based  on  the  overall  proportion  of  times  a  statement  was  true. 
These  estimates  also  exclude  the  first  5,000  iterations. 

4.3.  Results 

Table  1  shows  parameter  estimates  based  on  MCMC  estimation.  Although  95%  credible 
intervals  for  all  three  scale  parameters  included  zero,  there  was  evidence  that  the  scale  parameter 
for  LAG  was  different  than  the  scale  parameters  for  CT.  The  estimated  mode  of  (3lag  —  Pct 
was  1.34,  with  95%  credible  interval  (0.068,2.71).  The  positivity  criteria  across  studies  of  LAG 
tended  to  be  less  variable  than  the  positivity  criteria  used  in  studies  of  both  CT  (with  estimated 
probability  0.910)  and  MR  (with  estimated  probability  0.960). 


17 


Table  1  about  here  ] 


Figure  2  shows  estimated  SROC  curves  based  on  estimated  expected  values  of  {Alag,  Plag)< 
(ActtPct),  and  {Amr,Pmr)-  To  avoid  extrapolation  beyond  the  data,  curves  are  plotted  over 
the  observed  ranges  of  false  positive  rates. 


[  Figure  2  about  here 


Comparisons  based  on  overall  measures  of  test  performance  show  that  CT  and  MR  tended  to 
have  lower  expected  FP  and  TP  rates  than  LAG  (Table  2).  There  was  also  evidence  of  differences 
in  the  likelihood  ratio  positive  and  likelihood  ratio  negative  of  the  three  modalities  (Table  2).  The 
estimated  probability  that  LAG  had  a  better  (lower)  LR“  than  CT  was  0.951.  LAG  also  had  a 
lower  LR“  than  MR  with  an  estimated  probability  of  0.761.  This  is  evidence  that  negative  LAG 
results  are  more  informative  than  negative  CT  or  MR  results.  On  the  other  hand,  the  likelihood 
ratio  positive  (LR"*")  for  LAG  was  worse  (lower)  than  the  LR"*"  for  CT  with  probability  0.866,  and 
worse  than  the  LR+  for  MR  with  probability  0.976.  This  is  evidence  that  positive  CT  or  MR 
results  are  more  informative  than  positive  LAG  results. 


[  Table  2  about  here  ] 
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4.4.  Model  checks 


There  was  no  evidence  of  significant  lack  of  fit  in  the  HSROC  model.  Estimated  TP  and  FP  rates 
were  close  to  observed  values  (x74=23.45,  p-value=1.000),  as  were  estimated  log-odds  ratios 
(;^2^=7.71,  p-value=1.000).  Chi-square  degrees  of  freedom  for  goodness-of-fit  statistics  were 
calculated  using  the  number  of  independent  studies.  Normal  distributions  seemed  to  reasonably 
approximate  the  distribution  of  outpoint  parameters  (observed  5%  tail  probability=4.41%)  and 
accuracy  parameters  (observed  5%  tail  probability=4.44%). 

None  of  the  study  results  were  identified  as  influential  based  on  chi-square  residuals.  Fitted 
plots  showed  two  points  with  outlying  FP  rates  for  LAG.  Results  from  analyses  that  excluded 
these  points  were  similar  to  results  based  on  the  full  data  set  and  therefore  these  studies  were 
retained  in  analyses. 

5.  Discussion 

The  hierarchical  summary  ROC  (HSROC)  model  for  combining  estimated  pairs  of  sensitivity  and 
specificity  from  multiple  studies  extends  the  currently  used  fixed-effects  summary  ROC  (SROC) 
model.  The  HSROC  model  describes  within-study  variability  using  a  binomial  distribution  for 
the  number  of  positive  tests  in  diseased  and  not  diseased  patients.  An  underlying  ROC  model 
that  allows  variability  in  both  the  positivity  criteria  and  accuracy  across  studies  determines  the 
binomial  probabilities.  Variation  in  positivity  criteria  and  accuracy  is  modelled  using  a  Normal 
distribution,  with  a  linear  regression  in  the  mean  that  allows  dependence  on  study-level  covariates. 
More  heavy  tailed  distributions  (such  as  t  or  Cauchy)  can  also  be  used  instead  of  a  Normal  in 
the  second  level  of  the  hierarchical  model.  As  is  commonly  the  case  with  hierarchical  regression 
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models,  the  HSROC  model  allows  more  complete  accounting  of  between-study  variability  than  is 
possible  with  fixed-effects  formulations.  In  addition,  the  HSROC  model  provides  more  realistic 
accounting  of  within-study  variability  that  the  original  fixed-effects  SROC  model  [4],  which  used  a 
Normal  error  distribution  and  did  not  account  for  the  measurement  error  in  the  primary  covariate. 

The  HSROC  approach  provides  a  flexible  modelling  framework  that  can  be  extended  when 
more  information  is  available.  For  example,  when  studies  report  results  from  more  than  one 
modality,  the  hierarchical  model  can  be  appropriately  extended  to  incorporate  within-study  cor¬ 
relation.  This  extension  requires  information  about  the  joint  distribution  of  test  results,  either 
from  multiple  similar  pairs  across  several  studies,  from  cross-tabulation  of  test  results  within 
studies,  or  from  patient-level  data  within  studies.  When  patient  level  information  is  available, 
the  within-study  (Level  I)  model  can  be  extended  to  incorporate  patient-level  covariates.  This 
extended  model  can  also  be  applied  to  data  from  a  single  study  when  results  are  clustered  within 
participating  institutions  and/or  readers  (see  [24]  for  a  hierarchical  analysis  of  ROC  data). 

The  fully  Bayesian  approach  to  model  fitting,  although  computationally  intensive,  leads  to 
simulated  values  from  the  posterior  distribution  of  the  parameters,  on  the  basis  of  which  the 
analyst  can  easily  calculate  summaries  of  the  posterior  distribution  of  a  broad  range  of  functions 
of  the  parameters.  For  example,  in  our  reanalysis  of  the  Scheindler  data  we  derived  estimates 
of  functionals  of  the  posterior  distribution  of  likelihood  ratio  statistics,  and  differences  between 
likelihood  ratio  statistics  for  the  three  modalities.  On  the  basis  of  these  estimates  it  appears 
that  LAG  provides  different  clinical  information  than  CT  or  MR,  even  though  all  three  tests  had 
similar  overall  accuracy.  Bayesian  modelling  allowed  us  to  express  these  findings  via  probabilistic 
statements.  Such  probabilistic  estimates  may  be  easier  to  interpret  than  classical  frequentist 
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summaries. 


The  Bayesian  model  also  allows  description  of  sources  of  variability.  The  differences  we  found 
in  the  variability  of  positivity  criteria  were  consistent  with  the  technological  development  of  these 
three  tests.  At  one  extreme,  LAG  is  a  widely  used  standard  diagnostic  test  with  well  developed 
positivity  criteria.  At  the  other  extreme,  MR  was  a  new  diagnostic  approach  at  the  beginning  of 
the  meta-analyzed  time  period,  without  an  accepted  positivity  criteria.  The  estimated  variability 
of  outpoint  parameters  was  low  for  LAG.  The  variability  of  CT  outpoints  was  more  than  twice 
the  variability  of  LAG  outpoints,  and  the  variability  of  MR  outpoints  was  more  than  four  times 
the  variability  of  LAG  outpoints.  This  suggests  that  MR  accuracy  could  be  improved  through  the 
definition  and  adoption  of  good  positivity  criteria. 

The  advantages  of  the  HSROC  model  come  at  a  price:  estimation  requires  Markov  Chain 
Monte  Carlo  (MCMC)  simulation.  MCMC  estimation  requires  programming,  simulation,  eval¬ 
uation  of  convergence  and  model  adequacy,  and  synthesis  of  simulation  results.  Programming 
the  MCMC  simulation  can  be  time  consuming.  Although  some  versions  of  the  proposed  model 
can  be  fitted  within  publicly  available  software  (BUGS[25])  the  full  analysis  is  elaborate  and, 
depending  on  the  specific  model  under  consideration,  may  require  extensive  programming.  Even 
if  the  burden  of  programming  task  was  eliminated,  implementation  of  MCMC  simulation  will  still 
entail  nontrivial  analysis  tasks  including  evaluation  of  convergence  and  the  adequacy  of  prior  dis¬ 
tributions  and  this  requires  some  statistical  expertise.  However,  the  increased  complexity  of  the 
proposed  analysis  must  be  measured  against  the  advantages  from  the  approach,  including  more 
realistic  assumptions,  more  precise  description  of  the  impact  of  covariates,  and  greater  flexibility 
in  choice  of  descriptive  statistics. 
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Figure  1.  Detection  of  Lymph  Node  Metastases,  using  lymphangiography  (LAG),  computed 
tomography  (CT)  or  magnetic  resonance  (MR)  imaging:  Observed  true  positive  (TP)  and  false 
positive  (FP)  rates  are  from  data  reported  across  37  studies  that  were  originally  meta-analyzed 
by  Scheilder  and  colleagues[14]. 


Figure  2.  Estimated  summary  receiver  operating  characteristic  curves  for  lymphangiography 
(LAG),  computed  tomography  (CT)  and  magnetic  resonance  (MR)  imaging,  based  on  hierarchical 
regression  modelling. 


Table  1.  Hierarchical  ROC  parameter  estimates: 

estimated  posterior  modes  with  95%  credible  intervals  in  parenthesis. 


parameter 

test  type 

LAG  CT  MR 

0 

-0.178  (-0.783,0.492)  -1.900  (-3.050,-1.090)  -1.800  (-3.210,-0.731) 

<^e 

0.248  (  0.090,0.587)  0.653  (  0.202,  1.570)  1.070  (  0.264,  3.160) 

A 

2.360  (  1.660,3.120)  3.720  (  2.290,  6.030)  3.920  (  2.350,  6.330) 

1.080  (  0.271,2.780)  0.395  (  0.106,  1.150)  0.852  (  0.148,  2.900) 

0.658  (-0.273,1.720)  -0.683  (-1.650,  0.125)  -0.350  (-1.430,  0.587) 

Table  2.  Overall  rates  and  likelihood  ratios: 

posterior  modes  with  95%  credible  intervals  in  parenthesis 


test  type 

false  positive  rate  true  positive  rate 

likelihood  ratio  negative  likelihood  ratio  positive 

LAG 

CT 

MR 

0.141  (0.080,0.220)  0.666  (0.578,0.750) 

0.070  (0.044,0.102)  0.486  (0.324,0.643) 

0.047  (0.019.0.090)  0.542  (0.298,0.757) 

0.389  (0.288,0.502)  5.03  (2.96,8.51) 

0.553  (0.390,0.719)  7.15  (4.63,10.60) 

0.480  (0.260,0.729)  12.8  (5.91,24.5) 
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Assessing  Mammographers’  Accuracy; 

A  comparison  of  clinical  and  test  performance 

Abstract 

Direct  estimation  of  mammographers’  clinical  accuracy  requires  the  ability  to  capture  screening 
assessments  and  correctly  identify  which  screened  women  have  breast  cancer.  This  clinical  infor¬ 
mation  is  often  unavailable  and  when  it  is  available  its  observational  nature  can  cause  analytic 
problems.  Problems  with  clinical  data  have  led  some  researchers  to  evaluate  mammographers 
using  a  single  set  of  films.  Research  based  on  these  test  film  sets  implicitly  assumes  a  corre¬ 
spondence  between  mammographers’  accuracy  in  the  test  setting  and  their  accuracy  in  a  clinical 
setting.  However,  there  is  no  evidence  supporting  this  basic  assumption.  In  this  article  we  use 
hierarchical  models  and  data  from  27  mammographers  to  directly  compare  accuracy  estimated 
from  clinical  practice  data  to  accuracy  estimated  from  a  test  film  set.  We  found  no  evidence  of 
correlation  between  clinical  and  test  accuracy.  These  findings  raise  important  questions  about 
how  mammographer  accuracy  should  be  measured. 

keywords:  sensitivity,  specificity,  hierarchical  models,  mammography. 

running  title:  Assessing  Mammographers’  Accuracy:  clinical  versus  test  performance 
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1.  Introduction 

Screening  inammography  is  an  effective  method  of  detecting  early  stage  breast  cancer.  However, 
the  diagnostic  value  of  a  mammogram  depends  on  both  the  technical  quality  of  the  film  and 
a  mammographer’s  ability  to  interpret  that  film.  In  the  last  decade  mammographic  technol¬ 
ogy  has  been  relatively  stable,  allowing  researchers  to  focus  on  the  subjective  interpretation  of 
mammograms  (e.g.,[l,  2]). 

The  Mammography  Quality  Standards  Act  recognized  the  effect  of  mammographers’  inter¬ 
pretations  on  screening  assessments  and  encouraged  medical  audits  of  mammographers’  clinical 
assessments.  Evaluating  mammographers’  performance  using  clinical  assessments  is  intuitively 
appealing,  because  this  is  ’real  life’  performance.  For  many  researchers,  the  medical  audit  is 
the  gold  standard  measure  of  performance. [3]  However,  our  ability  to  draw  conclusions  about  the 
performance  of  particular  mammographers  from  these  clinical  assessments  is  limited  because  each 
mammographer  reviews  a  different  set  of  films.  The  difficulty  of  films  varies  with  characteristics 
of  the  women  evaluated  (e.g.,  breast  density),  characteristics  of  lesions  (e.g.,  size),  and  charac¬ 
teristics  of  technical  film  quality  (e.g.,  positioning).  Variability  in  film  difficulty  results  in  chance 
differences  among  mammographers.  Systematic  differences  in  the  difficulty  of  films  reviewed 
can  also  occur,  for  example,  when  mammographers  tend  to  send  difficult  cases  to  a  particular 
colleague.  Differences  in  the  number  of  films  reviewed  also  affects  comparisons  between  mammo¬ 
graphers  through  the  variability  of  estimated  performance.  Because  performance  estimates  based 
on  fewer  patients  tend  to  be  more  variable,  and  therefore  more  extreme,  comparisons  that  ignore 
differences  in  variability  can  be  misleading.  Statistical  models  have  a  limited  ability  to  adjust  for 
differences  in  the  films  read  by  each  mammographer.  [4,  5] 

Estimation  of  clinical  accuracy  is  further  complicated  by  the  influence  that  clinical  assessments 
have  on  the  probability  of  detecting  breast  cancer.  Screening  accuracy  estimation  focuses  on  the 
correspondence  between  a  mammographer’s  clinical  interpretation  and  a  woman’s  true  disease 
state.  Because  most  women  only  undergo  biopsy  if  a  mammographer  finds  an  abnormality, 
undetected  breast  cancer  cases  emerge  symptomatically  or  during  a  second  screening  exam.  Thus, 
undetected  breast  cancer  can  only  be  identified  when  follow-up  information  exists.  A  one  year 
follow-up  is  generally  used,  with  women  classified  as  disease  positive  at  the  time  of  a  screening 
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mammogram  if  breast  cancer  is  diagnosed  within  one  year. [3] 

Estimation  and  comparison  of  clinical  screening  performance  is  also  hampered  by  the  relatively 
low  incidence  of  breast  cancer.  The  one  year  incidence  of  invasive  breast  cancer  is  approximately 
3.5  per  1,000  among  American  women  who  are  over  49  years  old. [6]  Low  incidence  rates  make  it 
difficult  to  precisely  estimate  a  mammographer’s  rate  of  cancer  detection,  since  most  mammog- 
raphers  will  evaluate  very  few  cancers  in  a  single  year. 

Standardized  testing  of  mammographers  is  an  alternative  way  to  estimate  their  accuracy.  Us¬ 
ing  standardized  film  sets  removes  many  of  the  problems  with  clinical  data.  Each  mammographer 
views  the  same  films  in  the  same  setting  and  with  the  same  patient  information.  Test  sets  exclude 
films  from  women  without  necessary  follow-up  information,  so  that  true  disease  state  is  known 
with  a  high  degree  of  certainty.  Test  sets  can  also  include  more  films  from  women  with  breast 
cancer  than  would  be  seen  in  clinical  practice,  allowing  more  precise  estimation  of  sensitivity.  In 
summary,  use  of  a  test  film  set  controls  for  film  difficulty,  film  quality  and  the  information  pre¬ 
sented  during  film  evaluation,  offering  a  relatively  simple  method  of  estimating  mammographers' 
accuracy  under  standardized  conditions. 

Although  estimating  accuracy  from  assessments  of  standardized  film  sets  avoids  many  of  the 
problems  with  clinical  data,  the  artificial  conditions  introduce  other  problems.  Mammographers 
know  that  in  the  test  setting  their  decisions  will  not  affect  patient  care.  The  test  itself  may  be 
burdensome  given  time  constraints.  There  is  also  evidence  suggesting  that  the  higher  prevalence 
of  disease  in  test  film  sets  introduces  bias.  Egglin[7]  found  that  radiologists  were  more  likely  to 
interpret  arteriograms  as  positive  for  pulmonary  emboli  when  viewed  in  a  higher  prevalence  film 
set,  regardless  of  true  disease  state.  When  this  ‘context  bias’  exists,  sensitivity  increases  with 
increasing  prevalence  while  specificity  decreases. 

Studies  describing  mammographer  variability  based  on  test  film  sets  (e.g.,[l,  2])  implicitly 
assume  a  strong  correlation  between  mammographers’  performance  estimated  from  test  sets  and 
mammographers’  performance  in  clinical  practice.  However,  this  assumption  has  never  been 
tested.  In  this  article  we  directly  compare  mammographers’  clinical  and  test  performance. 

2.  Data 

We  analyzed  data  from  27  mammographers  practicing  at  a  large  staff  model  not-for-profit  health 
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maintenance  organization  (HMO).  The  mammographers  included  in  this  study  were  voluntary 
participants,  though  this  group  essentially  included  all  of  the  mammographers  practicing  with  the 
HMO  at  the  time  of  the  study. 

Both  clinical  and  test  data  sets  use  films  from  women  who  remained  enrolled  in  the  HMO 
for  at  least  two  years  after  their  index  mammogram.  Women  with  breast  cancer  were  identified 
using  the  regional  Surveillance  Epidemiology  and  End  Result  registry.[8]  Our  reference  standard 
for  true  disease  state  called  a  woman  ‘disease  positive’  at  the  time  of  her  screening  mammogram 
if  either  invasive  cancer  or  ductal  carcinoma  in  situ  were  detected  within  the  following  two  years. 
We  used  a  two  year  definition  because  routine  follow-up  care  included  mammographic  follow-up 
at  either  one  year  or  two  year  intervals,  depending  on  a  woman’s  particular  risk  factors  for  breast 
cancer. 

Clinical  Data:  Clinical  data  used  mammographers’  final  interpretations  and  recommendations 
based  on  mammograms  from  asymptomatic  women  screened  from  1990  through  1994.  Mam¬ 
mographers  interpretations  and  recommendations  have  been  collected  as  part  of  clinical  practice 
for  every  mammogram  evaluated  since  1986,  using  standardized  data  collection  forms.  During 
the  time  period  we  examined,  mammographer  interpretations  could  be  coded  as  ’negative’,  ’in¬ 
conclusive’,  or  ’positive’.  Final  interpretations  and  recommendations  were  combined  and  coded 
into  one  of  five  possible  clinical  assessments:  1)  negative  mammogram  and  recommendation  for 
mammographic  follow  up  at  1  year  or  later;  2)  inconclusive  mammogram  and  recommendation 
for  mammographic  follow  up  at  1  year  or  later;  3)  inconclusive  mammogram  and  recommenda¬ 
tion  for  follow  up  in  less  than  1  year  (short  interval  follow-up);  4)  inconclusive  mammogram  and 
recommendation  for  biopsy  or  surgical  referral;  and  5)  positive  mammogram. 

Test  Data:  Mammographers  were  evaluated  using  test  film  sets  during  late  1994  and  early 
1995.  As  part  of  an  educational  intervention,  each  mammographer  assessed  the  same  set  of 
screening  mammograms.  Test  mammograms  were  drawn  from  the  population  of  women  screened 
between  1985  and  1991,  using  stratified  random  sampling.  Most  (92.5%,  111/120)  films  were 
selected  from  the  1990/1991  time  period.  Films  were  stratified  by  the  woman’s  true  disease 
state  and  the  original  (clinical)  mammographer’s  assessment.  We  defined  recommendations  for 
short  interval  follow-up,  request  for  additional  work-up,  referral  to  biopsy,  and  positive  mammo- 
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gram  interpretations  as  positive  mammographic  assessments,  corresponding  to  clinical  assessment 
categories  3,  4  and  5.  Based  on  each  screened  w/oman’s  true  state  and  dichotomous  clinical  as¬ 
sessment,  we  created  four  strata:  true  positive  (TP)  films  (positive  assessment,  breast  cancer 
within  one  year);  false  negative  (FN)  films  (negative  assessment,  breast  cancer  within  one  year); 
true  negative  (TN)  films  (negative  assessment,  no  breast  cancer);  and  false  positive  (FP)  films 
(positive  assessment,  no  breast  cancer).  From  these  strata,  we  randomly  selected  23  TP  films, 
9  FN  films,  72  TN  films,  and  16  FP  films.  Because  of  the  stratified  sampling  scheme,  the  test 
film  set  was  not  representative  of  the  mix  of  films  seen  in  clinical  practice:  it  included  an  excess 
of  films  from  women  with  breast  cancer  and  films  that  originally  lead  to  incorrect  assessments. 
Out  of  these  120  films,  7  films  (3  TP  films  and  4  FP  films)  were  excluded  from  analyses  because 
marks  were  placed  on  films  during  the  course  of  the  study.  To  allow  correspondence  with  the 
clinical  analyses,  the  reference  standard  for  test  films  was  recalculated,  using  a  2  year  follow-up 
period.  Applying  the  2  year  follow-up  caused  one  TN  film  to  be  recoded  as  a  FN.  Within  the  113 
test  mammograms  used  for  analyses,  original  readers  were  67%  sensitive  and  86%  specific.  The 
average  age  of  screened  women  who  contributed  films  to  the  test  set  was  50  years,  ranging  from 
40  to  87  years. 

Mammograms  were  displayed  at  each  participating  mammography  clinic  in  a  dedicated  reading 
room.  Films  were  displayed  in  four  sets  of  30,  and  each  set  was  displayed  for  2  weeks.  Mammog- 
raphers  scheduled  a  time  to  review  films  and  were  given  1  hour  to  read  each  set  of  30  films.  Each 
‘film’  included  a  two-view  mammogram,  representing  a  single  screening  event,  and  the  woman’s 
most  recent  prior  two-view  screening  mammogram.  Prior  mammograms  were  unavailable  for 
43  women  (38%).  No  additional  clinical  information  was  provided,  and  mammographers  were 
not  provided  with  the  disease  prevalence  in  the  test  set.  Mammographers  provided  one  rating 
for  each  breast,  using  standardized  data  collection  forms.  The  5  possible  screening  assessments 
were:  1)  negative  or  benign;  2)  probably  benign  (short  interval  follow-up  needed);  3)  possibly 
abnormal  (additional  views  needed);  4)  suspicious  abnormality  (biopsy  should  be  considered);  and 
5)  highly  suggestive  of  malignancy.  Each  mammographer  provided  data  that  was  at  least  98% 
complete  (222/226  ratings)  and  15  of  the  27  mammographers  provided  complete  data.  There 
were  no  apparent  patterns  of  missing  data  between  mammographers.  These  breast-level  ratings 
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were  recoded  as  woman-level  assessments.  If  the  woman  was  diagnosed  with  breast  cancer  within 
two  years  of  the  mammogram,  then  the  rating  given  to  the  breast  with  disease  was  used  in  the 
analyses.  If  the  woman  did  not  develop  cancer  in  the  following  two  years,  then  the  maximum  of 
the  two  breast  ratings  was  used. 

3.  Methods 

We  are  primarily  interested  in  the  degree  of  correlation  between  mammographers’  accuracy  mea¬ 
sured  in  a  clinical  setting  and  accuracy  measured  in  a  test  setting.  The  accuracy  measures  we 
focused  on  are  sensitivity  and  specificity.  Sensitivity  is  the  proportion  of  women  with  breast  can¬ 
cer  who  had  a  positive  mammogram  assessment.  Specificity  is  the  proportion  of  women  without 
breast  cancer  who  had  a  negative  mammogram  assessment. 

Calculation  of  sensitivity  and  specificity  requires  definition  of  a  positive  assessment.  For 
clinical  assessments,  we  defined  ratings  3,  4  and  5  as  positive  mammograms,  corresponding  to 
recommendations  for  short  interval  follow-up  or  biopsy.  Unfortunately,  test  assessments  do  not 
completely  match  clinical  assessments.  This  is  partly  because  clinical  assessments  were  based  on 
final  recommendations  whereas  the  test  scale  included  a  recommendation  for  additional  views. 
Clinical  data  did  not  include  recommendations  for  additional  views  because  this  is  an  intermediate 
clinical  recommendation,  with  final  recommendations  based  on  these  additional  views.  Given  the 
difference  in  these  two  measurement  scales,  we  defined  positive  outcome  in  the  test  set  as  a 
recommendation  for  short  interval  follow-up,  additional  views,  or  biopsy  in  the  test  data  set, 
corresponding  to  ratings  2,  3,  4,  or  5.  Mammographers’  ratings  of  test  films  were  based  on  an 
explicitly  ordinal  scale  that  defined  a  recommendation  for  additional  films  (possibly  abnormal)  as 
more  strongly  indicative  of  disease  than  a  recommendation  for  short  interval  follow-up  (probably 
benign). 

3.1  Statistical  Model 

We  used  a  hierarchical  model  to  describe  mammographers’  test  and  clinical  performance  measures, 
and  to  examine  relationships  between  these  measures.  Each  mammographer  contributed  data 
from  two  2x2  tables,  showing  the  overall  agreement  between  their  assessments  and  womens’ 
disease  state.  We  use  the  following  notation: 
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Breast  Cancer: 


Mammographic  Interpretation: 


no 

yes 


negative  positive 


Vijoo 

VijOl 

VijlO 

Vijll 

Where  i  indicates  mammographer  and  j  =  1,2  indicates  the  data  source  (l=test 

and  2=clinical). 

The  model  we  use  accounts  for  within  mammographer  variability  in  estimated  sensitivity  and 
specificity  by  modeling  the  number  of  positive  assessments  each  mammographer  gave  to  diseased 
iVijii)  3nd  not-diseased  (Vijoi)  women  with  Binomial(niji,7r,ji)  and  Binomial(nijo.7rijo)  distrib¬ 
utions.  By  using  the  observed  sample  sizes  in  Binomial  distributions  for  each  mammographer 
and  data  set,  the  model  accounts  for  differences  in  the  amount  of  data  available.  The  binomial 
probability  of  a  positive  test  is  based  on  receiver  operating  characteristic  models, [9]  and  is  given 
by: 

TTijfc  =  logit“^(0ij  -I-  aijDijk) 


If  Dijk  was  coded  0  for  disease  negative  films  and  1  for  disease  positive  films,  then  under  this 
model  the  mammographer  evaluates  the  data  set  with  specificity  equal  to  1  -  logit“^(%) 
and  sensitivity  equal  to  logit“^(0ij  -I-  a^).  It  is  simpler  to  explain  the  interpretation  of  %  and 
an  in  terms  of  false  positive  rates  (equal  to  1  -  specificity)  and  true  positive  rates  (equal  to 
the  sensitivity).  The  parameter  %  captures  the  mammographer’s  overall  tendency  to  give 
positive  assessments,  so  that  true  positive  rates  increase  with  increasing  false  positive  rates.  The 
parameter  Oij  captures  the  difference  between  true  positive  and  false  positive  rates  and  measures 
the  log-odds  ratio  of  a  positive  test  for  films  with  breast  cancer  relative  to  films  without  breast 
cancer.  As  in  the  ROC  context,  we  call  9ij  “cutpoint  parameters"  and  "accuracy  parameters”. 

The  parameters  6ij  and  aij  could  be  calculated  directly  from  the  data.  However,  they  are 
not  estimable  when  either  sensitivity  or  specificity  is  100%,  a  situation  that  is  more  likely  when 
a  mammographer  evaluates  few  films.  The  hierarchical  model  uses  all  available  information 
to  better  estimate  these  individual  parameters.  Under  the  hierarchical  model,  both  cutpoint 
parameters  (%)  and  accuracy  parameters  (aij)  are  assumed  to  vary  across  mammographers  and 
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data  sources.  We  assume  9ij  and  follow  a  bivariate  normal  distribution,  implemented  as; 


Oii\Qu(Jei  ~  N{Qi,oli) 

CTqI  ~ 


I  conditionally  independent 


and 


^21)  ■  •  •  j  ©2)  Cr«2 

Oti2W\\iOi2\,  .  .  .  ,  ttml)  A2,  A,  Oca 


r\j 


r\j 


W(©2+T(0il-iE£l^il),^l2) 
N{k2  +  A(Q;ii  -  i 


>  conditionally  independent 


Thus,  the  model  assumes  that  within  each  data  set,  mammographers’  cutpoint  and  accuracy 
parameters  are  (conditionally)  independent.  The  linear  regression  models  for  6i2  and  a:i2  build  in 
correlation  between  cutpoint  parameters  and  correlation  between  accuracy  parameters,  with: 


corr(^ii,0i2)  =  Pe 
con{an,ai2)  =  pa 


raei 


These  correlation  parameters  are  more  informative  than  the  between  dataset  correlation  of  sen¬ 
sitivity  or  specificity.  Correlation  in  sensitivity  and  specificity  can  be  driven  by  mammographers’ 
overall  tendency  to  provide  positive  calls.  The  correlation  parameters  pg  and  pa  separate  the 
overall  tendency  to  call  a  film  positive  from  the  ability  to  distinguish  between  films  from  women 
with  and  without  breast  cancer.  Under  this  model,  pg  measures  the  correlation  between  cut- 
point  parameters  that  are  associated  with  overall  preponderance  to  call  a  film  ‘positive’  while 
Pa  measures  association  between  accuracy  parameters  that  are  independent  of  these  cutpoint 
parameters. 

Because  the  regression  model  is  centered,  the  expected  value  of  9i2  is  ©2  and  the  expected 
value  of  0-12  is  A2.  Assuming  that  %  and  aij  are  normally  distributed  and  linked  via  a  regression 
model  allows  fuller  use  of  the  available  data,  resulting  in  better  estimation.  Mammographer’s 
cutpoint  and  accuracy  parameters  are  smoothed  toward  overall  expected  values  ©j  and  Aj,  with 
the  degree  of  smoothing  determined  by  the  amount  of  data  each  contributes  to  the  model. 
Estimates  for  mammographers  with  less  data  will  tend  to  be  nearer  to  expected  values  than 
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estimates  for  mammographers  with  more  data,  while  corresponding  interval  estimates  widen  to 
reflect  lack  of  information  available  for  these  parameters. 

The  hierarchical  model  is  completed  by  specifying  prior  distributions  for  the  remaining  un¬ 
known  parameters.  Priors  were  chosen  to  cover  the  range  of  plausible  values  of  parameters  and 
were  selected  to  be  uninformative.  We  used  a  Normal(0,10)  prior  for  0i,  02,  Ai,  and  A2,  and 
a  Normal(0,100)  prior  for  r  and  A.  We  used  an  inverse  gamma,  r“^(0.5, 2),  for  uei,  ae2,  c^ai 
and  aa2-  This  prior  is  diffuse,  but  does  not  overweight  large  values.  Quartiles  of  the  r“^(0.5, 2) 
distribution  are  3.03,  8.80,  and  39.41.  The  parameters  0i,  02,  Ai,  A2,  r.  A,  aei,  ag2,  (Xai  and 
ac,2  are  assumed  to  be  mutually  independent. 

This  model  was  estimated  using  the  BUGS  program. [10]  To  improve  estimation,  the  disease 
state  indicator  Dijk  was  centered  so  that  =  5  for  disease  positive  films  and  Ajfc  =  —5  for 
disease  negative  films.  This  transformation  does  not  affect  the  interpretation  of  the  parameters 
aijk  and  dijk-  Standard  model  diagnostics  were  used  to  assess  convergence  of  the  sampler,  as 
described  in  the  CODA  manual. [11]  These  models  resulted  in  estimated  posterior  distributions 
for  the  model  parameters.  We  present  estimated  posterior  modes  and  95%  credible  intervals 
based  on  the  2.5%  and  97.5%  percentiles.  The  posterior  mode  was  estimated  by  the  posterior 
mean  for  approximately  symmetric  distributions,  and  by  the  posterior  median  for  skewed  posterior 
distributions. 

4.  Results 

There  was  wide  variability  in  the  amount  of  clinical  data  available  for  each  mammographer 
(Table  1).  The  27  mammographers  clinically  evaluated  an  average  of  1890  films  during  the  four 
year  period  (range  232  to  3818),  and  saw  an  average  of  15  mammograms  from  women  with 
breast  cancer  (range  1  to  32).  The  average  clinical  prevalence  rate  across  mammographers  was 
8  cancers  per  1,000  mammograms. 

Plots  of  the  sensitivity  and  specificity  suggest  moderate  positive  correlation  between  clinical 
and  test  performance.  Figure  1  shows  that  overall,  mammographers  tended  to  be  both  more 
sensitive  and  more  specific  in  clinical  practice.  The  observed  correlation  between  clinical  and  test 
sensitivities  was  -0.096;  correlation  between  specificities  was  0.446. 
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[  Table  1  about  here 


[  Figure  about  here  ] 

The  hierarchical  model  accounts  for  within  mammographer  variability  in  sensitivity  and  speci¬ 
ficity  and  accounts  for  differences  in  the  number  of  films  read  in  clinical  practice.  The  model  can 
be  used  to  better  estimate  each  mammographers’  clinical  and  test-based  sensitivity  and  speci¬ 
ficity,  and  thus  to  better  estimate  between  dataset  correlation  in  sensitivity  and  specificity.  Model 
based  estimates  of  sensitivity  and  specificity  combine  information  from  the  entire  sample  with  each 
mammographer’s  information.  The  degree  to  which  estimates  differ  from  observed  values  reflects 
the  amount  of  data  available,  the  values  of  other  parameter  estimates  (i.e.,  9^,  d^,  dji,  d!i2,  rand 
A)  and  underlying  distributional  assumptions.  Estimates  of  clinical  specificity  were  equal  to  model 
estimates  because  these  were  based  on  large  numbers  of  films.  In  contrast,  estimates  of  clinical 
sensitivity  were  more  strongly  influenced  by  additional  information,  especially  for  mammographers 
who  evaluated  very  few  films.  Model-based  estimates  of  between  dataset  correlation  of  sensitivity 
and  specificity  were  similar  to  observed  correlation  estimates.  Correlation  between  clinical  and 
test  sensitivity  was  0.185  with  95%  credible  interval  (-0.269,0.593).  Correlation  between  clinical 
and  test  specificity  was  0.408  with  95%  credible  interval  (0.161,0.616). 

We  found  little  evidence  of  correlation  between  clinical  and  test  performance  parameters 
(Table  2).  Our  point  estimate  of  correlation  between  clinical  and  test  outpoints  was  moderate 
(^pg  =  0.220)  although  the  95%  credible  interval  was  broad  and  covered  zero.  The  estimated 
probability  that  pe  >  0  was  89.3%.  Our  point  estimate  of  the  correlation  between  clinical  and 
test  accuracies  was  near  zero  {pa  =  —0.026). 

We  found  expected  overall  differences  in  test  and  clinical  accuracy.  The  test  film  set  was 
constructed  to  be  more  difficult  than  films  seen  in  usual  clinical  practice,  and  as  expected  the 
estimated  mean  clinical  accuracy  parameter  (A2)  was  greater  than  the  estimated  mean  test 
accuracy  parameter  (Ai),  indicating  that  overall  readers  were  more  accurate  when  evaluating 
clinical  data  than  test  data. 
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Point  estimates  also  demonstrated  that  mammographers  had  an  overall  tendency  to  give  more 
positive  assessments  in  their  clinical  practice  than  in  the  test  setting  (mode  ©2  >  mode  ©1), 
even  though  the  prevalence  of  breast  cancer  was  much  higher  in  the  test  setting. 

Estimated  between  mammographer  variability  tended  to  be  higher  in  clinical  practice  than  in 
the  test  setting  (e.g.,  >  cr|i  and  cr^2  >  possibly  reflecting  wider  variability  in  the  films 

each  mammographer  reads  in  clinical  practice,  or  the  relatively  small  number  of  cancer  films  each 
mammographer  evaluated  over  the  course  of  four  years  in  clinical  practice. 


Table  2  about  here 


5.  Discussion 

These  results  represent  a  comprehensive  comparison  of  mammographers’  assessments  in  test 
and  clinical  settings.  The  clinical  data  was  based  on  automated  collection  of  mammographers’ 
interpretations  and  recommendations.  The  data  systems  also  allowed  two  year  follow-up  of  each 
woman  screened.  The  test  data  included  a  relatively  large  set  of  113  mammograms  and  included 
30  cancers.  Finally,  our  statistical  model  allowed  for  differences  in  the  number  of  films  each 
mammographer  assessed  during  clinical  practice. 

There  was  general  agreement  between  observed  values  and  hierarchical  jnodel  results.  Mam¬ 
mographers  tended  to  be  less  accurate  when  evaluating  the  more  difficult  test  film  set,  and 
tended  to  give  more  positive  assessments  in  their  clinical  practice.  Thus,  we  found  no  evidence 
of  context  bias  as  described  by  Egglin[7].  That  is,  mammographers  did  not  tend  to  make  more 
positive  assessments  in  the  higher  prevalence  test  film  set.  However,  we  cannot  conclude  from 
this  study  that  context  bias  does  not  exist,  because  the  test  context  included  both  a  higher  disease 
prevalence  and  a  more  difficult  set  of  films. 

Model-based  estimates  of  between  dataset  correlation  of  sensitivity  were  stronger  than  ob¬ 
served  correlation,  and  the  estimated  between  dataset  correlation  of  specificity  was  statistically 
different  from  zero.  However,  between  dataset  correlation  of  sensitivity  and  specificity  appears  to 
be  driven  by  correlation  in  the  mammographers  tendency  to  call  tests  positive  rather  than  corre- 
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lation  in  their  accuracy  evaluating  the  two  data  sets.  We  found  moderate,  but  not  statistically 
significant,  correlation  between  mammographers’  overall  preponderance  to  identify  cancer  using 
the  two  data  sources.  But  there  was  no  apparent  correlation  between  the  hierarchical  model’s 
accuracy  parameters. 

We  do  not  believe  that  the  lack  of  correlation  between  clinical  and  test  performance  resulted 
from  differences  in  outcome  scales.  The  basic  assumption  that  we  are  testing  is  that  these  two 
measures  are  correlated  because  both  are  measures  of  the  same  underlying  construct,  mam- 
mographer  accuracy.  We  are  not  interested  in  the  equality  of  these  two  measures;  we  expect 
these  accuracy  estimates  to  differ  because  of  differences  in  film  difficulty,  film  quality,  and  the 
information  provided  to  mammographers. 

We  do  not  believe  that  the  lack  of  correlation  between  clinical  and  test  performance  resulted 
from  dichotomizing  the  outcome  scales.  We  did  not  attempt  to  model  the  ordinal  outcomes 
directly  or  via  the  area  under  the  receiver  operating  characteristic  (ROC)  curve  because  in  both 
clinical  and  test  settings  mammographers'  maximum  false  positive  rates  were  relatively  low. 
Because  of  this,  the  area  under  their  ROC  curves  were  strongly  influenced  by  false  positive 
rates  that  were  outside  of  the  observed  data  range  especially  for  clinical  data.  The  sensitivity 
and  specificity  pairs  we  used  in  analyses  contained  most  of  the  information  available  from  ROC 
curves. 

There  are  many  possible  explanations  for  the  lack  of  correlation  in  these  data.  Our  ‘gold 
standard'  for  true  disease  state  was  based  on  a  two-year  follow-up  interval,  and  misclassification 
of  diseased  and  not  diseased  women  may  have  attenuated  observed  correlation.  Our  sample  of  27 
mammographers  may  have  been  too  small  to  detect  statistically  significant  correlation,  although 
point  estimates  suggest  there  was  not  clinically  relevant  correlation  in  accuracies.  Examining 
mammographers  practicing  within  the  same  HMO  may  have  reduced  variability  so  that  correlation 
was  not  observable.  Many  of  the  mammographers  in  this  study  worked  together  and  discussed 
difficult  cases  with  each  other  on  a  day-to-day  basis.  Finally,  lack  of  correlation  may  have 
resulted  from  differences  in  the  type  of  films  included  in  the  two  data  sets.  Clinical  data  included 
assessments  of  exams  based  on  imaging  studies,  such  as  ultrasound  and  magnification  views.  If 
evaluation  of  2  view  mammograms  requires  different  skills  than  evaluation  of  additional  work- 
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up  images,  then  the  inclusion  of  these  films  in  the  clinical  set  could  attenuate  the  estimated 
correlation  between  clinical  and  test  accuracy.  However,  excluding  these  films  would  drastically 
reduce  the  number  of  cancer  cases  included  in  the  clinical  set  and  could  bias  comparisons  by 
reducing  the  clinical  data  set  to  films  that  the  original  reader  was  able  to  assess  without  additional 
work-up.  Because  these  results  were  unexpected,  we  must  consideration  possible  explanations. 
However,  these  explanations  are  ultimately  conjecture. 


The  apparent  lack  of  correlation  between  test  and  clinical  assessments  could  be  interpreted 
in  at  least  two  ways.  One  interpretation  is  that  evaluations  based  on  clinical  assessments  and 
evaluations  based  on  test  film  sets  are  measuring  two  different  kinds  of  accuracy.  Because  we 
are  interested  in  clinical  performance,  concluding  that  test-based  assessments  of  accuracy  are 
different  from  clinical  accuracy  means  either  throwing  out  the  test  data  sets  as  a  reasonable 
means  of  mammographer  evaluation,  or  seeking  out  ways  to  make  test  evaluations  more  com¬ 
parable  to  clinical  evaluations.  A  second  interpretation  is  that  the  apparent  lack  of  correlation 
between  clinical  and  test  performance  resulted  from  differences  in  the  clinical  case  mix  of  partic¬ 
ipating  mammographers.  Clinical  data  included  assessments  based  on  both  standard  screening 
mammograms  and  screening  mammograms  that  included  additional  work  up,  such  as  ultrasound 
and  magnification  views.  We  do  not  know  how  these  different  types  of  films  were  distributed 
across  mammographers,  or  whether  there  were  any  informal  systems  of  referral  at  the  mammog¬ 
raphy  centers.  Systematic  differences  between  mammographers  could  also  have  been  introduced 
through  differences  in  screened  populations,  for  example,  differences  in  the  average  age  of  women 
screened.  Concluding  that  the  clinical  data  are  problematic  means  either  throwing  out  the  clinical 
data  as  a  means  of  mammographer  evaluation,  or  seeking  out  ways  to  make  the  clinical  evalu¬ 
ations  more  comparable  across  mammographers.  Unfortunately,  our  analyses  cannot  guide  our 
conclusions  about  clinical  and  test  data,  though  they  caution  us  against  extrapolating  results 
from  one  setting  into  another. 
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Table  1.  Mammographic  Assessments  of  27  Mammographers:  Rate  of 
positive  assessments,  indicating  disease,  with  the  total  number  of  assessments  in 
parenthesis. 


mammographer 

Test 

specificity 

(N) 

Data 

sensitivity 

(N) 

Clinical  Data 

specificity  sensitivity 

(N)  (N) 

1 

0.880 

(83) 

0.897 

(29) 

0.922 

(1715) 

1.000 

(14) 

2 

0.687 

(83) 

0.833 

(30) 

0.816 

(1492) 

1.000 

(14) 

3 

0.687 

(83) 

0.833 

(30) 

0.804 

(2341) 

0.929 

(14) 

4 

0.880 

(83) 

0.833 

(30) 

0.823 

(2129) 

0.933 

(15) 

5 

0.867 

(83) 

0.800 

(30) 

0.896 

(2818) 

0.880 

(25) 

6 

0.756 

(82) 

0.733 

(30) 

0.917 

(2221) 

0.941 

(17) 

7 

0.867 

(83) 

0.767 

(30) 

0.965 

(1733) 

0.684 

(19) 

8 

0.904 

(83) 

0.700 

(30) 

0.911 

(2045) 

0.917 

(12) 

9 

0.867 

(83) 

0.833 

(30) 

0.879 

(1742) 

0.826 

(23) 

10 

0.831 

(83) 

0.800 

(30) 

0.832 

(1435) 

0.833 

(12) 

11 

0.867 

(83) 

0.800 

(30) 

0.915 

(3299) 

0.935 

(31) 

12 

0.831 

(83) 

0.724 

(29) 

0.865 

(  230) 

1.000 

(2) 

13 

0.867 

(83) 

0.833 

(30) 

0.870 

(971) 

0.800 

(10) 

14 

0.904 

(83) 

0.867 

(30) 

0.877 

(  675) 

0.500 

(2) 

15 

0.783 

(83) 

0.833 

(30) 

0.881 

(2546) 

0.955 

(22) 

16 

0.880 

(83) 

0.800 

(30) 

0.930 

(441) 

1.000 

(1) 

17 

0.855 

(83) 

0.867 

(30) 

0.883 

(3167) 

0.960 

(25) 

18 

0.854 

(82) 

0.867 

(30) 

0.822 

(1451) 

1.000 

(11) 

19 

0.771 

(83) 

0.833 

(30) 

0.901 

(3786) 

0.875 

(32) 

20 

0.904 

(83) 

0.767 

(30) 

0.905 

(1276) 

0.714 

(7) 

21 

0.855 

(83) 

0.833 

(30) 

0.908 

(3186) 

0.800 

(25) 

22 

0.904 

(83) 

0.733 

(30) 

0.880 

(2585) 

0.947 

(19) 

23 

0.855 

(83) 

0.793 

(29) 

0.943 

(1643) 

0.846 

(13) 

24 

0.807 

(83) 

0.828 

(29) 

0.913 

(1726) 

1.000 

(10) 

25 

0.819 

(83) 

0.767 

(30) 

0.864 

(1151) 

1.000 

(4) 

26 

0.892 

(83) 

0.900 

(30) 

0.920 

(2169) 

0.842 

(19) 

27 

0.759 

(83) 

0.833 

(30) 

0.867 

(  663) 

0.833 

(6) 
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Table  2.  Hierarchical  model  estimates  from  the  posterior  distribution. 


parameter  and  description 

Estimates 

mode  95%  credible  region 

01 :  expected  cutpoint  parameter,  test  data 

Ai:  expected  accuracy  parameter,  test  data 

-0.100  (-0.330,  0.125) 
3.322  (2.888,  3.553) 

©2:  expected  cutpoint  parameter,  clinical  data 

A2;  expected  accuracy  parameter,  clinical  data 

0.066  (-0.216,  0.352) 
4.361  (3.928,  4.798) 

(r|i:  between-mammographer  variance  of  outpoints,  test  data 
cr^j:  between-mammographer  variance  of  accuracy,  test  data 
a02'  between-mammographer  variance  of  outpoints,  clinical  data 
cr^,:  between-mammographer  variance  of  accuracy,  clinical  data 

0.261  (0.153,  0.489) 

0.409  (0.215,  0.823) 

0.337  (0.190,  0.658) 

0.502  (0.247,  1.096) 

r:  regression  coefficient,  cutpoint  parameters 

A;  regression  coefficient,  accuracy  parameters 

0.560  (-0.341,  1.530) 
-0.048  (-1.020,  0.945) 

Pffi  correlation  between  clinical  and  test  outpoints 

Pa',  correlation  between  clinical  and  test  accuracy 

0.220  (-0.142,  0.486) 
-0.026  (-0.477,  0.446) 
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Figure  lA.  Sensitivity  in  clinical  practice  versus  sensitivity  in  a  test  setting  for 
27  mammographers. 
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ABSTRACT.  This  paper  is  concerned  with  the  design  and  analysis  of  mammography  reading  studies.  In  particu- 
lar  we  consider  studies  aimed  at  evaluating  interventions  to  improve  the  accuracy  with  which  mammograms 
are  read.  A  simple  randomized  design  is  suggested  in  which  a  relatively  large  group  of  readers  read  sets  of  mammo- 
grams  before  and  after  an  intervention  phase.  We  propose  solutions  to  three  difficult  statistical  issues  that  arise 
in  the  context  of  such  studies:  (i)  the  choice  of  primary  outcome  measure;  (ii)  the  data  analysis  technicjue  to 
be  employed;  and  (iii)  the  methodology  for  calculating  sample  sizes  for  readers  and  images  to  be  read. 

First,  we  argue  in  favor  of  using  sensitivity  and  specificity  as  the  primary  outcome  measures  rather  than  receiver 
operating  characteristic  (ROC)  curves  in  mammography  studies,  although  the  latter  are  considered  state  of  the 
art  for  many  types  of  radiology  reading  studies.  We  argue  that  sensitivity  and  specificity  are  more  clinically 
relevant  and  conceptually  more  straightforward  than  ROC  curves.  Second,  we  suggest  a  bivariate  approach  to 
data  analysis  for  evaluating  intervention  effects  on  sensitivity  and  specificity.  This  accommodates  the  correlations 
inherent  between  these  measures  and  allows  for  estimation  of  joint  effects  on  them.  Finally  we  propose  a  method 
for  power  calculations  that  uses  computer  simulation  techniques.  Simple  formulas  for  sample  size  calculations 
are  not  available  in  part  because  variability  in  accuracy  amongst  readers  and  variation  in  difficulty  among  images 
introduce  complexity  into  power  calculations.  The  simulation  method  that  we  propose  accommodates  such 
complexity  and  is  easy  to  implement. 

The  methodology  was  motivated  by  a  study' funded  by  the  Department  of  Defense  to  evaluate  the  potential 
efficacy  of  an  educational  intervention.  In  the  context  of  this  study  we  illustrate  the  steps  involved  in  power 
calculations  and  apply  the  data  analytic  techniques  to  the  sort  of  data  expected  to  result  from  this  study.  Though 
the  proposed  methods  were  motivated  by  this  particular  study,  the  statistical  considerations  are  relevant  more 
broadly  in  mammography  and  indeed  in  other  types  of  radiologic  imaging  studies.  Standards  for  the  conduct  of 
radiologic  reading  studies  are  not  yet  well  developed,  as  they  are  for  randomized  clinical  trials  and  for  case- 
control  studies.  We  hope  that  the  discussion  in  this  paper  will  add  to  the  dialogue  necessary  for  development 
of  such  standards,  j  clin  epidemiol  50;12:1327-1338,  1997.  ©  1997  Elsevier  Science  Inc. _ 

KEY  WORDS.  ROC  curves,  sensitivity  and  specificity,  computer  simulation,  diagnostic  tests,  screening 


1.  INTRODUCTION 

Mammography  screening  for  breast  cancer  has  been  shown 
to  be  associated  with  decreased  breast  cancer  mortality,  at 
least  in  women  over  the  age  of  50  years  [1].  Major  efforts 
are  currently  underway  to  improve  participation  by  women 
in  screening  programs  [2].  Nevertheless,  there  is  concern 
about  the  quality  of  mammography  screening  and  there  is 
general  agreement  that  improvements  in  quality  may  lead 
to  improvements  in  the  performance  of  mammography  as 
a  screening  modality.  Quality  might  be  improved  for  exam¬ 
ple  by  improving  the  imaging  procedures.  Alternatively,  im- 
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provements  in  the  accuracy  with  which  mammographers  in- 
terpret  mammograms  may  improve  the  performance  of 
screening  mammography.  Recent  studies  [3,4]  have  shown 
that  there  is  considerable  variability  amongst  radiologists  in 
their  interpretations  of  screening  mammograms.  Elmore  et 
al.  [3]  observed  that  sensitivities  ranged  from  74%  to  96% 
and  that  specificities  ranged  from  35%  to  89%  among  10 
radiologists  reading  150  selected  mammograms.  Beam  et  d. 
[4]  using  a  much  larger  sample  of  108  radiologists,  each  read¬ 
ing  79  mammograms,  found  sensitivities  in  the  range  of  47- 
100%  and  specificities  in  the  range  of 35-99%.  These  obser- 
vatiorvs  suggest  that  improvement  in  interpretation  may  be 
possible. 

As  part  of  a  project  called  the  Mammography  Quality 
Improvement  Project  (MQIP)  funded  by  the  Department 
of  Defense  and  aimed  at  improving  the  quality  of  mammog- 
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raphy  screening  in  niral  communities,  we  are  developing  an 
educational  program  to  improve  the  accuracy  with  which 
radiologists  interpret  mammograms.  The  educational  inter¬ 
vention  is  composed  of  a  series  of  five  sessions  in  which 
mammographers  read  films  and  are  provided  with  immedi¬ 
ate  feedback  on  the  accuracy  of  their  interpretations.  Feed¬ 
back  is  provided  using  a  laptop  personal  computer  that  is 
mailed  to  the  radiologist  prior  to  his  reading  session.  Tire 
computer  program  emphasizes  the  particular  features  of  each 
mammogram  that  are  relevant  to  determining  the  disease 
status  of  the  woman  screened.  Eventually  it  may  be  possible 
to  disseminate  this  sort  of  intervention  over  computer  net¬ 
works  thus  making  it  attractive  in  terms  of  easy  accessibility 
and  low  cost. 

To  evaluate  the  impact  of  such  an  intervention  on  im¬ 
provements  in  diagnostic  accuracy  it  will  eventually  be  nec¬ 
essary  to  perform  a  study  of  radiologists*  interpretations  of 
screening  mammograms  in  their  actual  practices.  As  a  pre¬ 
liminary  step  to  such  a  large-scale  study,  we  will  evaluate 
the  intervention  effects  in  a  more  controlled  setting.  Spe¬ 
cifically,  we  will  have  a  number  of  radiologists  read  a  se¬ 
lected  set  of  mammograms  before  and  after  the  intervention 
and  evaluate  changes  in  accuracy.  The  mammograms  in¬ 
cluded  in  this  controlled  study  will  be  composed  of  about 
50%  from  women  with  disease,  a  proportion  much  larger 
than  would  be  observed  in  practice  but  necessarily  high  to 
estimate  sensitivity  rates  in  a  small-scale  study.  Mammo¬ 
grams  will  be  selected  to  represent  a  reasonably  iDroad  range 
of  interpretive  difficulty. 

The  purpose  of  this  paper  is  to  elucidate  some  of  the  key 
statistical  issues  in  the  design  of  such  a  controlled  reading 
study.  Standards  for  the  design  of  such  studies  are  not  well 
developed.  This  contrasts  with  therapeutic  clinical  trials 
and  epidemiologic  studies  where  the  basic  elements  of  study 
design  are  now  fairly  well  standardized  [5].  The  question  we 
propose  to  address  in  this  reading  study,  namely  evaluation 
of  an  intervention  effect  in  a  controlled  setting,  is  a  stan¬ 
dard  sort  of  question  addressed  in  diagnostic  imaging  re¬ 
search.  Hence  the  design  issues  which  are  dealt  with  here 
will  have  implications  for  future  studies  in  mammography 
and  in  other  diagnostic  test  settings.  These  same  issues  also 
arise  in  reading  studies  designed  to  compare  different  im¬ 
aging  modalities.  The  key  issues  concern  the  choice  of  rele¬ 
vant  primary  outcome  measures,  appropriate  data  analysis 
strategies,  and  methodology  for  power  calculations  that  in¬ 
corporates  variability  among  radiologists  and  among  images. 
Broader  issues  in  regards  to  study  designs  for  evaluating  im- 
tests  have  been  discussed  in  a  more  general  sense  in 
the  literature  [6,7]. 

In  Section  2,  we  consider  two  sets  of  measures  that  can 
be  used  to  define  accuracy  in  reading  mammograms;  first, 
sensitivity  and  specificity  and  second,  RCX:  curves.  We  ar¬ 
gue  in  favor  of  the  former,  in  part,  because  they  are  more 
clinically  relevant  and  most  easily  understood,  but  also  be¬ 
cause  the  latter  can  provide  inappropriate  conclusions  con- 
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ceming  intervention  benefits.  In  Section  3,  we  detail  the 
basic  elements  of  the  statistical  design  of  our  study  that 
could  be  considered  a  prototype  for  evaluating  intervention 
effects  in  diagnostic  radiology.  An  approach  to  joint  analy¬ 
sis  of  sensitivity  and  specificity  is  outlined  in  Section  4.  In 
Section  5,  we  describe  methodology  for  power  calculations 
that  are  appropriate  for  the  proposed  design  and  analysis. 
We  propose  the  use  of  computer  simulation  methods  for 
calculating  power  because  they  allow  for  complex  designs 
and  can  easily  incorporate  variability  amongst  radiologists 
and  images.  Having  described  the  steps  involved  in  calculat- 
ing  power  in  Section  5,  we  then  apply  these  procedures  to 
the  proposed  MQIP  study  in  Section  6,  in  order  to  illustrate 
the  methods.  Concluding  remarks  follow  in  Section  7. 

2.  MEASURES  OF  ACCURACY 
2.1  Definitions 

A  radiologist  reading  a  set  of  mammograms  for  a  woman  in 
our  study  will  classify  each  breast  according  to  his  or  her 
suspicion  of  its  showing  malignancy.  The  ACR  lexicon  for 
rating  a  breast  [8]  which  we  will  employ,  defines  a  5-point 
scale  with  category  1  indicating  “normal,  routine  follow-up 
recommended,**  2  indicating  “benign,  routine  follow-up,*’  3 
indicating  “probably  benign,  early  recall  recommended,**  4 
indicating  “suspicious  for  cancer,  consider  biopsy,**  and 
5  indicating  “highly  suspicious  for  cancer,  biopsy  recom¬ 
mended,”  A  common  definition  of  a  screen  positive  mam¬ 
mogram  is  one  that  receives  a  rating  of  4  or  greater.  These 
are  mammograms  that  are  sufficiently  suspicious  for  cancer 
that  biopsy  is  recommended  and  hence  they  have  an  impact 
on  clinical  practice.  Sometimes  a  rating  of  a  3  or  greater  is 
considered  positive.  Because  of  the  clinical  implications  of 
ratings  4  and  5,  we  will  focus  on  the  positivity  criterion  of 
category  >4  here. 

Given  a  definition  for  screen  positivity,  since  there  is  a 
rating  for  each  breast,  one  can  calculate  sensitivities  and 
specificities  with  either  “woman**  or  “breast**  as  the  unit  of 
analysis.  TTie  latter  includes  all  non-diseased  breasts  (in¬ 
cluding  non-diseased  breasts  from  women  with  cancer),  as 
the  denominator  for  specificity  and  all  diseased  breasts  as 
the  denominator  for  sensitivity.  However,  since  the  conse¬ 
quences  of  false  positive  and  false  negative  errors  relate  to 
the  woman  (rather  than  the  breast),  it  seems  more  clinically 
relevant  to  use  woman  rather  than  breast  as  the  unit  of  anal¬ 
ysis.  Thus,  for  example,  we  count  the  proportion  of  women 
with  disease  who  have  it  detected  as  the  sensitivity,  rather 
than  defining  the  sensitivity  to  be  the  proportion  of  diseased 
breasts  which  are' detected.  This  accords  with  previous  liter¬ 
ature  [3],  One  could  use  the  maximum  of  the  ratings  for  the 
left  and  right  sides  as  the  woman  level  rating  for  calculation 
of  sensitivity  and  specificity.  Occasionally,  however,  a 
woman  with  unilateral  disease  may  not  have  it  detected  in 
the  affected  side  but  will  have  a  positive  mammogram  on 
the  unaffected  side.  In  this  case,  using  the  maximum  rating 


Design  of  a  Study  to  Improve  Accuracy  in  Reading  Mammograms 


1329 


will  inappropriately  inflate  the  sensitivity.  We  define  sensi¬ 
tivity  instead  as  the  proportion  of  women  with  disease  who 
have  it  detected  (a  rating  of  ^4)  on  the  affected  side.  The 
specificity  is  the  proportion  of  women  without  disease  who 
have  a  maximum  rating  of  less  than  4. 

ROC  analysis  is  a  statistical  technique  used  to  describe 
accuracy  of  diagnostic  tests  when  the  test  outcome  is  either 
ordinal  or  continuous  as  opposed  to  binary.  The  rating  data 
generated  in  radiology  reading  studies  are  ordinal  and  ROC 
analysis  is  often  considered  optimal  for  the  analysis  of  such 
studies  as  is  evidenced,  for  example,  in  a  recent  issue  of 
Academic  Radiology  [9].  An  ROC  curve  is  constructed  by 
varying  the  criterion  used  for  defining  a  positive  mammo¬ 
gram  from  ‘Vating  >2”  to  “rating  >5,”  plotting  the  associ¬ 
ated  sensitivity  and  1 -specificity  values  against  each  other, 
and  finally  fitting  a  curve  to  the  points  so  that  the  curve  is 
anchored  at  (0,0)  and  (1,1).  Various  algorithms  exist  for 
fitting  a  curve,  the  most  notable  being  the  Dorfman-Alf  al¬ 
gorithm  based  on  the  binormal  model  110]  and  the  empirical 
nonparametric  method  that  simply  connects  observed  ROC 
points  linearly.  The  area  under  the  ROC  curve  is  usually 
used  to  summarize  accuracy.  Again  we  suggest  that  woman 
rather  than  breast  should  be  the  unit  of  analysis  in  defining 
the  ROC  curve.  That  is,  in  calculating  the  sensitivity  corre¬ 
sponding  to  the  criterion  “rating  >  it  should  be  defined 
as  the  proportion  of  women  with  cancer  who  have  a  rating 
of  ^  fC  on  an  affected  side- 

2.2  ROC  Analysis  Versus  Sensitivity  and  Specificity 

ROC  analysis  was  developed  originally  for  diagnostic  tests 
with  results  on  some  arbitrary  scale.  Its  primary  advantage 
is  that  it  allows  one  to  assess  the  inherent  capacity  of  the 
test  to  distinguish  between  diseased  and  non-diseased  sub¬ 
jects  without  linking  the  test  to  some  particular  threshold 
for  defining  screen  positive  [11,12].  This  seems  appropriate 
in  radiology  experiments  when  image  ratings  are  arbitrary 
numbers  with  no  specific  clinical  meaning  attached  to 
them.  In  that  case,  shifts  in  the  distributions  of  ratings  are 
of  no  consequence  as  long  as  they  are  equally  shifted  for 
diseased  and  non-diseased  subjects.  In  mammography,  how¬ 
ever,  mammogram  ratings  have  very  specific  clinical  mean¬ 
ings  and  consequent  clinical  implications.  Uniform  shifts 
in  the  frequencies  with  which  rating  categories  are  chosen 
can  have  major  clinical  implications. 

Moreover,  in  contrast  to  the  prototype  setting  for  ROC 
analysis,  shifts  between  certain  diagnostic  categories  are  of 
more  importance  than  others.  For  example,  as  noted  by  Ko- 
pans  [13],  whether  an  image  is  rated  in  category  4  versus 
category  5  has  no  clinical  impact.  Similarly  classifications 
in  category  1  versus  category  2  are  clinically  irrelevant. 
However,  shifts  between  categories  4  or  5  and  between  1 
or  2  can  have  a  big  impact  on  the  ROC  analysis.  To  illus¬ 
trate  this  consider  the  setting  shown  in  Fig.  1.  The  effect 
of  intervention  in  this  setting  is  to  shift  classifications  of 


FIGURE  1.  An  hypothetical  setting  where  the  sensitivity  and 
specificity  associated  with  the  clinically  relevant  criteria  are 
unchanged  but  the  empirical  ROC  curves  indicate  a  benefit 
of  intervention.  The  (false  positive,  true  positive)  points  as¬ 
sociated  with  categories  5, 4,  3,  and  2  are  (0.10, 0.30),  (0.25, 
0.70),  (0.45,  0.85),  and  (0.75,  0.95)  respectively,  pre-inter¬ 
vention;  and  (0.10,  0.60),  (0.25,  0.70),  (0.45,  0.85),  and 
(0.55,  0.95),  respectively,  post-intervention. 

diseased  observations  from  category  4  to  category  5  and  clas¬ 
sification  of  non-diseased  patients  from  category  2  to  cate¬ 
gory  1.  Though  these  changes  are  of  no  clinical  import,  the 
ROC  type  analysis  indicates  a  benefit  for  the  intervention. 
Thus  an  ROC  analysis  can  indicate  a  benefit  of  intervention 
even  though  a  clinically  relevant  benefit  does  not  exist. 

Of  even  more  concern  is  the  fact  that  a  clinically  relevant 
benefit  of  intervention  can  occur  even  when  the  ROC 
curves  pre-  and  post- intervention  are  the  same.  Consider 
the  ROC  curve  depicted  in  Fig.  2  for  such  a  situation.  The 
location  on  the  ROC  curve  of  the  points  associated  with 
the  criterion  “rating  ^  category  4”  indicate  that  sensitivity 
was  significantly  increased  without  decreasing  specificity. 
This  clinically  relevant  improvement  in  test  accuracy  does 
not  manifest  itself  in  an  improvement  in  the  ROC  curves 
since  the  pre-  and  post- intervention  curves  are  the  same. 
(Interestingly,  classic  binormal  ROC  curves  do  not  fit  the 
situation  depicted  in  Fig.  2  and  a  binormal  ROC  analysis 
in  this  setting  may  incorrectly  indicate  that  the  ROC  curve 
post-intervention  is  improved  over  that  pre-intervention). 

The  fact  that  ROC  analysis  can  yield  inappropriate  con¬ 
clusions  regarding  the  clinically  relevant  effects  of  interven¬ 
tion  argues  against  its  use  for  the  primary  analysis  of  mam¬ 
mography  reading  study  data.  Another  valid  argument  for 
not  using  an  ROC  analysis  is  that  it  is  complicated  and 
not  easily  understood  by  clinicians.  Moreover,  the  so-called 
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FIGURE  2.  An  hypothetical  setting  where  ROC  curve  is  un^ 
changed  by  the  intervention  but  there  is  a  clinically  relevant 
benefit.  The  sensitivity  associated  with  the  clinically  rele¬ 
vant  criterion  is  improved  from  0.50  to  0.70  while  the  associ¬ 
ated  false  positive  rate  remains  unchanged  at  0.09.  The 
(fabe  positive »  true  positive)  points  associated  with  catego¬ 
ries  5, 4,  3,  and  2  are  (0.03,  0.27),  (0.09,  0.50),  (0.15,  0,83), 
and  (0.39,  0.93)  pre-intervention  and  (0.03,  0.27),  (0.09, 
0.70),  (0.15, 0.83),  and  (0.39, 0.93 )  post-intervention.  Tliese 
points  before  intervention  are  labeled  with  circles  and  after 
intervention  are  labeled  with  triangles. 

“area  under  the  curve”  that  summarizes  the  ROC  curve  in 
a  single  number  has  an  interpretation  that  is  not  well  known 
or  easily  understood.  It  can  be  interpreted  as  the  probability 
that  a  radiologist  will  have  a  greater  suspicion  of  cancer 
from  a  mammogram  from  a  woman  with  disease  than  from 
a  woman  without  [14].  This  probability,  however,  seems  to 
be  of  more  theoretical  than  practical  relevance. 

We  propose  using  the  more  clinically  meaningful  quanti¬ 
ties  of  sensitivity  and  specificity  for  the  primary  data  analysis 
and  employing  ROC  analysis  as  a  secondary  descriptive  de¬ 
vice.  Though  ROC  analysis  may  be  statistically  more  power¬ 
ful  in  some  settings,  statistical  power  is  of  secondary  impor¬ 
tance  relative  to  clinical  relevance.  Any  study  should  be 
designed  so  that  it  has  adequate  power  to  detect  changes  in 
the  quantities  that  are  of  practical  relevance.  Hence,  we 
suggest  that  power  calculations  for  a  mammography  reading 
study  should  be  based  on  the  ability  to  detect  changes  in 
sensitivity  and  specificity  rather  than  on  the  basis  of  de¬ 
tecting  changes  in  ROC  curves. 

3.  STUDY  DESIGN 

We  now  describe  the  basic  elements  of  the  design  that  we 
propose  for  studies  evaluating  intervention  effects  on  read¬ 
ing  accuracy  in  mammography.  In  this  prototype  design,  ra¬ 


diologists  are  randomly  assigned  to  intervention  and  control 
groups,  with  the  number  in  the  former  being  denoted  by  Rj 
arid  the  number  in  the  latter  denoted  by  Rc.  Two  image 
sets  are  constructed  with  M  images  in  each  set  S  =  1,2.  In 
set  S,  a  number  Mp  are  from  women  with  disease  and  this 
number  may  differ  between  the  two  sets.  Each  reader  reads 
one  set  of  images  before  the  intervention  period  and  one 
set  after.  It  is  important  that  the  sets  before  and  after  inter¬ 
vention  be  different  since  readers  may  remember,  to  some 
degree,  images  that  they  have  previously  read.  Half  of  the 
readers  chosen  at  random  in  each  of  the  intervention  and 
control  groups  read  set  1  before  intervention  and  set  2  after 
intervention.  The  other  half  read  them  in  the  opposite  or¬ 
der:  set  2  followed  by  set  1.  This  cross-over  of  film  sets  elimi¬ 
nates  the  possibility  of  systematic  bias  due  to  film  sets.  The 
design  is  balanced  in  the  sense  that  set  1  is  read  equally 
often  before  and  after  the  intervention  phase  in  both  the 
intervention  and  control  groups,-  and  similarly  for  set  2. 
Readers  are  told  the  approximate  prevalence  of  diseased  im¬ 
ages,  i.e.,  ,(Md  +  Md)/2M  and  that  this  varies  between  the 
two  sets.  The  rationale  for  telling  the  readers  the  approxi¬ 
mate  prevalence  is  that  it  will  become  apparent  in  any  case 
after  reading  the  first  set  of  images  and  that  a  priori  knowl¬ 
edge  of  it  should  reduce  the  potential  impact  as  much  as 
possible  on  the  observed  improvement  in  accuracy.  Readers 
will  use  the  ACR  lexicon  to  classify  mammograms  and  for 
each  reading  it  will  be  determined  if  it  is  screen  positive  or 
negative  according  to  whether  the  rating  is  at  least  4  or  less 
than  4. 

Images  for  inclusion  in  the  study  need  to  be  selected  so 
that  average  sensitivity  and  specificity  at  the  baseline  assess¬ 
ment  are  relatively  low.  That  is,  improvements  in  accuracy 
should  be  possible  with  the  sets  of  images  chosen.  If,  in  the 
absence  of  intervention  all  images  from  women  with  disease 
were  easily  identified  as  such,  the  observed  sensitivities  pre- 
and  post-intervention  would  be  close  to  1  and  a  change  in 
sensitivity  would  not  be  identifiable  regardless  of  the  actual 
effect  of  intervention.  Thus  at  least  some  of  the  diseased 
images  should  be  difficult  but  not  impossible  to  identify  as 
being  from  women  with  disease.  Analogous  considerations 
apply  to  specificity  and  the  choice  of  non-diseased  images 
included  in  the  study. 

4.  DATA  ANALYSIS 

Having  described  the  basic  elements  of  the  design  and  the 
choice  of  primary  outcomes,  we  turn  now  to  the  strategy 
for  data  analysis.  There  are  two  components  to  the  analysis. 
The  first  concerns  a  comparison  of  post-  versus  pre-inter¬ 
vention  reading  accuracy  among  the  Rj  readers  in  the  inter¬ 
vention  group.  The  second  is  the  comparison  of  changes 
from  pre-  to  post- intervention  between  the  intervention 
and  control  groups.  We  first  consider  the  former  analysis, 
in  part  because  it  allows  us  to  define  notation  most  easily. 

The  purpose  of  this  data  analysis  is  to  compare  the  overall 
sensitivity  pre-intervention  with  that  post-intervention 
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and  to  compare  the  overall  specificity  pre^ intervention  with 
that  post'intervention.  If  Sr,|,rf  and  denote  the  observed 
pre^  and  post^intervention  sensitivities  for  radiologist  r, 
then  the  observed  change  in  the  overall  sensitivity  ^ylsen- 
sitivity)  is  the  average  change  in  sensitivities  across  radiolo' 
gists  in  the  intervention  group: 


1 

iiT(sensitivity)  =  —  /  (Sr.pt^t 


Similarly  the  observed  change  in  the  overall  specificity  in 
the  intervention  group  is 


1 

ij(specificity)  =  ~  /  (Kpost  ”  Pr.p, 
Rt 

'  r=l 


where  and  F,,post  denote  the  observed  pre-  and  post- 
intervention  specificities  for  radiologist  r.  Variance  estima¬ 
tors  for  /dr(sensitivity)  and  /ir(specificity)  are  provided  in 
the  appendix.  Although  .dT(sensitivity)  and  idjCspecificity) 
are  sample  means  of  changes  in  sensitivities  and  specificities, 
their  variances  are  not  given  by  the  usual  variance  formulae 
for  sample  means.  Indeed  such  sample  variances  would  over¬ 
estimate  the  variability.  Rather  the  correct  variance  estima¬ 
tors  rely  on  acknowledging  that  there  are  in  essence  two  strata 
of  radiologists  in  the  design,  which  are  defined  by  the  order¬ 
ing  of  the  two  image  sets  which  are  rated.  The  variances  of 
zSrCsensitivity)  and  ^jCspecificity)  are  averages  of  stratum- 
specific  variances,  as  shown  in  Appendix  A. 

Sensitivity  and  specificity  are  highly  correlated  parame¬ 
ters.  Radiologists  with  high  sensitivities  tend  to  have  low 
specificities.  This  will  happen  for  example  if  they  have  a 
low  threshold  for  classifying  images  as  diseased.  Similarly, 
changes  in  sensitivities  and  specificities  induced  by  the  in¬ 
tervention  may  be  highly  correlated.  In  particular,  if  the 
intervention  simply  changes  the  implicit  threshold  a  radiol¬ 
ogist  has  for  classifying  a  mammogram  as  diseased  then  the 
sensitivity  and  specificity  will  both  be  changed  but  in  oppo¬ 
site  directions.  Thus  it  is  important  to  assess  joint  effects  of 
intervention  on  sensitivity  and  specificity  and  to  account 
for  correlations  between  them  in  making  inference.  This 
can  be  accomplished  by  employing  a  bivariate  analysis  ap¬ 
proach  which  is  a  special  case  of  multivariate  analysis,  and 
for  which  there  is  a  large  statistical  literature  [15].  Using 
this  approach  to  test  the  hypotheses  that  the  true  average 
sensitivity  and  specificity  are  unchanged  by  the  interven¬ 
tion,  Hq:  ZlT(sensitivity)  ==  ^rCspecificity)  =  0,  a  chi-square 
test  statistic  is  calculated.  Thjs  statistic  is  a  function  of  the 
observed  average  changes,  4r( sensitivity)  and  4T(specifi- 
city),  their  variances  and  also  their  correlation.  An  expres¬ 
sion  for  the  chi-squared  statistic  is  provided  in  the  Appendix. 

In  addition  to  simply  testing  the  hypothesis  of  no  inter¬ 
vention  effect,  it  will  be  important  to  provide  a  confidence 
region  for  the  intervention  effects  on  sensitivity  and  speci¬ 
ficity  based  on  the  observed  data.  That  is,  a  range  of  inter¬ 
vention  effects,  {i4r(sensitivity),  z^rCspecificity)},  which  are 
consistent  with  the  observed  data.  Such  a  joint  95%  confi¬ 


dence  region  is  defined  formally  as  the  set  of  values  (x,y) 
for  which  the  hypothesis  Hq:  {^(sensitivity)  =  x,  ^(spec¬ 
ificity)  =  y}  is  not  rejected  at  the  5%  significance  level. 
This  region  is  an  ellipse,  centered  at  the  observed  interven¬ 
tion  effect  (iT(sensitivity),  /4T(specificity)).  We  refer  the 
interested  reader  to  the  text  [15]  by  Johnson  and  Wichem 
(1988,  section  5.2)  for  technical  details  regarding  its  calcu¬ 
lation.  Code  for  calculating  such  regions  has  been  written 
by  Murdoch  and  Chow  for  the  S-PLUS  statistical  software 
package  and  can  be  obtained  from  the  S-archive  on  the 
Statlib  computer  site  (http://lib.stat.cmu.edu).  In  a  similar 
fashion  a  joint  confidence  region  for  the  overall  average 
sensitivity  and  specificity  pre-  or  post-intervention  can  be 
calculated.  It  is  calculated  using  the  observed  radiologist 
specific  sensitivities  and  specificities  pre-  and  post-interven¬ 
tion,  and  requires  only  calculation  of  the  means,  variances 
and  correlations  for  these  parameters.  To  illustrate  these 
analyses,  Fig.  3  displays  joint  confidence  regions  based  on 
a  simulated  data  set.  In  our  opinion  these  confidence  regions 
provide  a  simple  summary  of  the  information  contained  in 
study  data  regarding  intervention  effects  on  reading  accu¬ 
racy.  In  the  simulated  data,  the  analyses  show  that  sensitiv¬ 
ity  was  increased  by  the  intervention  whereas  there  is  no 
evidence  of  change  in  specificity. 

So  far  we  have  considered  the  comparison  of  post-  versus 
pre-intervention  reading  accuracy  within  the  intervention 
group.  To  attribute  changes  in  accuracy  to  the  intervention 
it  will  be  necessary  to  compare  the  changes  in  the  interven¬ 
tion  group  with  those  in  the  control  group.  Without  the 
control  group  comparison,  observed  changes  might  be  at¬ 
tributed  to  other  factors,  such  as  the  increased  reading  prac¬ 
tice  or  increased  awareness  of  reader  fallibility  induced  by 
participation  in  the  study.  Thus,  turning  now  to  the  com¬ 
parison  of  intervention  and  control  groups,  the  main  hy¬ 
pothesis  to  be  tested  is  that  the  changes  in  sensitivity  and 
specificity  in  the  intervention  group  are  the  same  as  those 
in  the  control  group.  Using  a  subscript  T  to  denote  the  in¬ 
tervention  group  and  subscript  C  to  denote  the  control 
group,  the  null  hypothesis  is  Ho,  dc (sensitivity)  =  i4T( sen¬ 
sitivity),  zlc(specificity)  =  ^7(specificity).  A  test  statistic 
that  has  a  chi-square  distribution  with  2  degrees  of  freedom 
is  described  in  the  appendix  for  testing  this  hypothesis.  Joint 
confidence  regions  for  the  differences  in  changes  between 
the  groups,  namely  d7(sensitivity)  —  /dc (sensitivity)  and 
^7(specificity)  “  A: (specificity),  can  be  calculated  using 
methods  analogous  to  those  described  earlier  for  the  pre- 
versus-post- intervention  comparison. 

5-  METHODOLOGY 

FOR  POWER  CALCULATIONS 

Power  calculations  for  the  reading  study  are  somewhat  com¬ 
plicated.  They  must  accommodate  the  facts  that  readers 
vary  in  their  accuracy  parameters  of  sensitivity  and  specific¬ 
ity,  that  their  sensitivities  and  specificities  are  likely  nega¬ 
tively  correlated,  that  images  vary  in  difficulty  and  that  a 
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FIGURE  3.  Joint  confidence  regions  for  sensitivity  and  speci¬ 
ficity  both  pre  and  post  intervention  {upper  panel)  along 
with  a  joint  confidence  region  (lower pane!)  for  the  changes 
in  these  parameters.  Data  used  in  this  illustration  were  gen¬ 
erated  using  computer  simulation  methods  described  in  sec¬ 
tions  5  and  6.  Points  correspond  to  observed  data  for  individ¬ 
ual  radiologists. 

bivariate  analysis  approach  will  be  employed.  These  factors 
together  make  analytic  expressions  for  sample  size  intracta¬ 
ble.  We  instead  take  a  computer  simulation  approach  to 
power  calculations.  The  simulation  approach  to  power  cal¬ 
culation  is  a  general  and  standard  method  and  indeed  soft¬ 
ware  has  been  developed  for  certain  types  of  applications 
[16].  The  basic  idea  is  to  repeatedly  simulate  data  as  it  is 
expected  or  hoped  to  arise  in  the  course  of  the  study,  and 
determine  how  often  the  null  hypothesis  is  rejected.  By 
definition  the  statistical  power  of  the  study  is  the  proportion 
of  simulated  studies  in  which  the  null  hypothesises  rejected. 
One  calculates  the  power  in  this  fashion  using  various  sam¬ 
ple  sizes  until  a  sample  size  is  found  that  provides  adequate 


power.  This  indirect  computer  intensive  approach  to  sample 
size  calculation  is  easily  accomplished  with  modem  com¬ 
puters. 

5.1  Models  for  Pre-  and  Post-intervention  Accuracy 

To  simulate  study  data  we  need  to  define  precisely  the 
mechanisms  giving  rise  to  the  data.  We  therefore  need  to 
make  assumptions  about  the  reading  accuracies  before  and 
after  intervention.  For  this  purpose  we  suppose  that  before 
intervention  a  reader  correctly  assesses  a  woman  with  tumor 
as  being  diseased  with  probability  Pf^,.  The  probability  Pfjj 
depends  on  the  image  denoted  by  i  and  on  the  reader,  de¬ 
noted  by  r.  The  probabilities  Pf^,-  will  presumably  be  higher 
if  the  tumor  is  clearly  visible  in  image  i  than  if  it  is  not. 
The  probabilities  will  also  be  higher  if  the  radiologist  is  con^ 
servative  and  is  inclined  to  recommend  biopsy  for  border¬ 
line  cases.  We  let  be  the  sensitivity  of  the  average  radiol¬ 
ogist  to  the  average  film  from  a  woman  with  tumor.  The 
variability  among  films  in  terms  of  the  difficulty  that  readers 
have  in  assessing  them,  is  captured  by  specifying  a  distribu¬ 
tion  for  the  sensitivities  that  the  average  reader  has  in  as¬ 
sessing  the  films.  Here  we  assume  that  the  average  reader’s 
sensitivity  to  films  varies  uniformly  in  an  interval  (S^  —  aP^ 

+  aP)  across  different  films.  Thus  for  the  average  radiolo¬ 
gist,  easier  films  are  read  with  sensitivity  closer  to  +  oP 
and  more  difficult  films  are  read  with  sensitivity  closer  to 

—  aP.  In  a  similar  fashion,  on  the  average  film  from  a 
diseased  woman,  the  sensitivity  of  different  readers  is  as¬ 
sumed  to  vary  uniformly  in  an  interval  (S^  —  4-  b^) 

across  radiologists.  Thus  radiologists  with  high  sensitivity 
to  the  average  film  will  have  sensitivity  closer  to  +  b^. 
In  the  appendix  we  detail  a  logistic  model  with  random  ef¬ 
fects  (also  called  a  mixed  model)  for  the  probabilities  Pf^, 
that  give  rise  to  inter-image  and  inter-reader  variability  as 
postulated  here.  It  is  assumed  that  on  the  logistic  scale  there 
are  no  interactions  between  reader  and  image  specific  effects 
on  the  sensitivity. 

Observe  that  for  the  purposes  of  simulating  data,  by  speci¬ 
fying  and  aP  we  can  now  generate  a  random  image  effect 
by  choosing  a  random  number  in  (S^  ±  a^)  that  corresponds 
to  the  sensitivity  an  average  radiologist  has  for  detecting  it. 
Similarly,  having  a  specified  and  b^  we  are  in  a  position 
to  generate  a  random  reader  effect  by  choosing  a  random 
number  in  (S^  —  b^,  +  b^)  that  corresponds  to  his  sensi¬ 

tivity  to  the  average  film.  The  logistic  model  displayed  in 
the  appendix  then  yields  the  probability  P,*,,  that  that  reader 
has  of  correctly  assessing  that  image  as  diseased. 

Analogous  considerations  apply  to  the  determination  of 
randomly  generated  specificities  which  vary  across  radiolo¬ 
gists  and  across  images  from  women  without  disease.  Values 
for  parameters  F^,  bP  and  aP  need  to  be  specified  in  order 
to  define  the  data  generating  process.  Here,  F^  is  the  proba¬ 
bility  that  the  average  radiologist  will  correctly  assess  the 
average  non-diseased  image  as  such,  radiologists  vary  uni- 
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formly  in  -  b^,  +  b^)  in  their  specificities  to  the 

average  non-diseased  film,  and  images  from  women  without 
disease  vary  uniformly  in  (F^  —  oPy  +  a^)  in  the  probabil¬ 
ities  of  the  average  reader  correctly  classifying  them.  The 
sensitivities  and  specificities  from  single  radiologists  should 
be  correlated.  In  the  Appendix  we  describe  how  negative 
correlation  between  sensitivities  and  specificities  within  ra¬ 
diologists  can  be  built  into  the  data  simulation  mechanism. 

In  summary,  for  each  study  radiologist  we  simulate  his/ 
her  sensitivity  and  specificity  to  the  average  diseased  and 
non-diseased  films,  respectively,  by  randomly  sampling  cor¬ 
related  numbers  from  (S^  “  +  b^)  and  (F^  ■”  b^, 

+  b*^),  respectively.  For  each  study  film  we  determine  the 
sensitivity  or  specificity  that  an  average  radiologist  has  for 
it  by  randomly  sampling  a  number  from  (S^  —  a^,  +  a^) 

or  (F^  -  a^,  F^  +  a^).  Finally,  for  each  combination  of  film 
i  and  radiologist  r,  we  can  calculate  P-?,  or  P.>,  which  is  the 
probability  that  the  radiologist  will  assess  that  image  cor¬ 
rectly. 

The  PPr  and  P^^  pertain  to  probabilities  before  interven¬ 
tion  in  the  treatment  and  control  groups.  One  also  needs 
to  specify  treatment  effects  in  order  that  corresponding 
probabilities  after  intervention  can  be  calculated.  We  pos¬ 
tulate  that  after  intervention  the  quantities  and  F^  are 
changed  to  new  values  but  that  the  variations  among  read¬ 
ers  and  among  images  remain  the  same.  In  the  Appendix 
we  define  in  a  mathematically  precise  way  a  logistic  model 
that  incorporates  such  intervention  effects. 

5.2  Simulated  Study  Data  Qeneration 

Having  specified  statistical  models  for  pre-  and  post- inter¬ 
vention  rating  probabilities  that  incorporate  variation 
among  radiologists  and  among  images,  we  now  turn  to  the 
simulation  of  study  data  in  accordance  with  the  study  design 
that  we  proposed  in  section  3.  The  first  step  is  to  generate 
images  and  image  sets.  This  entails  generating  M  diseased 
images  (i.e.,  M  image-specific  parameters,  one  for  each  im¬ 
age),  generating  M  non-diseased  images,  and  finally  from 
the  2M  films  choosing  M  at  random  without  replacement 
to  form  film  set  1.  The  remaining  M  films  constitute  film 
set  2.  The  next  step  is  to  generate  Rr  intervention  readers 
and  Rc  control  readers  and  assign  them  film  sets.  That  is, 
for  each  of  Rj  +  Rc  readers  we  generate  pairs  of  pre-  and 
post-intervention  sensitivities  and  specificities  to  average 
diseased  and  non-diseased  films  according  to  the  models  de¬ 
scribed  in  section  5.1.  Of  the  total  Rr  +  Rc  readers,  Rr  are 
assigned  at  random  to  the  intervention  group  and  the  re¬ 
maining  Rc  to  the  control  group.  Finally  film  set  orderings 
are  assigned  to  the  readers  with  half  of  the  intervention 
readers  selected  at  random  being  assigned  set  1  first  and  the 
other  half  assigned  set  2  first.  Similarly,  Rc/2  control  readers 
are  assigned  set  1  followed  by  set  2  and  the  other  Rc/2  read¬ 
ers  are  assigned  film  sets  in  the  opposite  order. 

The  final  step  in  generating  data  for  a  simulated  study  is 
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to  actually  generate  the  readings  for  each  reader  and  image 
combination.  That  is,  for  each  reader  and  for  each  of  the 
M  films  in  his/her  pre-intervention  set,  a  binary  random 
variable  is  generated  which  is  his/her  assessment  of  whether 
or  not  that  image  shows  disease  using  the  probability 
Pfji  p,e  if  the  image  is  diseased  and  1  ~  Pf^i.pre  if  the  image  is 
not  diseased.  Similarly,  for  each  of  the  M  films  in  his/her 
post- intervention  set  a  similar  binary  random  variable  is 
generated  using  Pj^i.post  or  1  —  P^,-,jH»t  noting  that  the  pre-  and 
post-probabilities  differ  by  different  amounts  for  interven- 
tion-versus-control  radiologists. 

Having  generated  the  simulated  study  data  the  test  statis¬ 
tics  of  interest  can  now  be  calculated.  Data  are  simulated 
(first  the  probabilities,  then  the  ratings)  and  results  calcu¬ 
lated  under  the  same  assumptions  and  study  design  many 
times,  with  1000  or  5000  simulated  datasets  being  typical 
numbers  used  for  power  calculations.  The  proportion  of  sim¬ 
ulated  studies  in  which  the  null  hypothesis  is  rejected  is 
the  calculated  study  power  for  that  design  and  under  those 
assumptions. 

6.  POWER  CALCULATIONS: 

RESULTS  FOR  THE  MQIP  STUDY 

To  fix  ideas,  we  now  illustrate  the  computer  simulation 
method  for  power  calculations  in  the  MQIP  study.  This  il¬ 
lustration  also  identifies  some  sources  of  data  to  guide  as¬ 
sumptions  for  power  calculations. 

We  need  to  choose  assumed  parameters  for  the  baseline 
sensitivities  and  specificities,  for  the  variations  among  radi¬ 
ologists  and  among  images  and  for  intervention  effects  of 
interest.  We  assume  that  the  median  sensitivity  pre-inter¬ 
vention,  S^,  in  our  study  will  be  in  the  range  of  0.70  to 
0.80.  This  accords  with  previous  studies  that  found  median 
sensitivities  of  0.70  and  0.80  [3,4].  Median  pre-intervention 
specificity  will  also  be  assumed  to  lie  in  the  range  of  0.70 
to  0.80.  Beam  et  ai  [4]  found  a  median  specificity  of  0.94 
for  mammograms  from  women  with  normal  mammograms 
and  a  median  specificity  of  0.60  for  mammograms  from 
women  with  benign  disease.  Elmore  et  al.  [3]  found  a  median 
specificity  of  0.94.  In  contrast  to  these  studies,  we  will  in¬ 
form  the  radiologists  of  the  average  prevalence  that  is 
higher  than  that  expected  in  a  practical  screening  setting. 
Because  of  this  and  the  fact  that  the  films  in  our  study  will 
be  somewhat  difficult,  we  anticipate  an  initial  specificity 
lower  than  observed  in  those  studies.  The  variation  amongst 
radiologists  in  sensitivities  and  specificities  will  be  assumed 
such  that  —  0.20  and  =  0.20,  which  is  in  agreement 
with  the  range  of  approximately  40%  in  sensitivities  (and 
specificities)  among  radiologists  observed  in  Beam’s  study. 
We  could  find  no  data  on  inter-image  variability  to  suggest 
appropriate  values  for  and  We  assume  that  they  are 
of  the  same  order  of  magnitude  as  the  inter-rater  variability 
parameters,  =  0.20.  With  regard  to  intervention 

effects  of  interest,  we  consider  that  changes  of  10  percentage 
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TABLE  1.  Power  to  detect  a  10%  increase  in  sensitivity  and  no  effect  on 
Readers  Films 


per  group 

{Rr) 

per  se 
(M) 

20 

30 

20 

30 

20 

30 

20 

30 

20 

45 

20 

45 

20 

45 

20 

45 

30 

30 

30 

30 

30 

30 

30 

30 

30 

45 

30 

45 

30 

45 

30 

45 

40 

30 

40 

30 

40 

30 

40 

30 

40 

45 

40 

45 

40 

45 

40 

45 

Pre-intervention 

sensitivity 

0.70 

0.70 

0.80 

0.80 

0.70 

0.70 

0.80 

0.80 

0.70 

0.70 

0.80 

0.80 

0.70 

0.70 

0.80 

0.80 

0.70 

0.70 

0.80 

0.80 

0.70 

0.70 

0.80 

0.80 


Pre-intervention 

specificity 

0.70 

0.80 

0.70 

0.80 

0.70 

0.80 

0.70 

0.80 

0.70 
0.80 
•  0.70 
0.80 
0.70 
0.80 
0.70 
0.80 

0.70 

0.80 

0.70 

0.80 

0.70 

0.80 

0.70 

0.80 


specificity  in  the  intervention  group 

_ _ _ Power 

Within  Comparison  with 

intervention  group _ control  group 


0.70 

0.66 

0.79 

0.77 

0.81 

0.82 

0.91 

0.92 

0.81 

0.83 

0.93 

0.91 

0.94 

0.95 

0.99 

0.99 

0.92 

0.94 

0.97 

0.98 

0.98 

0.99 

0.99 

0.99 


0.38 

0.34 

0.45 

0.44 

0.48 

0.53 

0.61 

0,64 

0.48 

0.52 

0.60 

0.61 

0.66 

0.66 

0.80 

0,79 

0.61 

0.60 

0.73 

0,75 

0.79 

0.80 

0.88 

0.89 


All  tests  are  two  sided  and  are  tested  at  a  significance  level  of  0.05. 


points  in  either  sensitivity  or  specificity  are  of  interest. 
However,  we  calculated  power  for  a  variety  of  intervention 
effects. 

Practical  considerations  concerning  time  and  cost  dictate 
the  range  of  sample  sizes  that  are  feasible  and  therefore,  for 
which  power  calculations  are  performed.  We  anticipate  that 
no  more  than  approximately  80  radiologists  are  available 
for  the  reading  study  in  the  rural  communities  in  which  our 
mammography  quality  improvement  study  is  being  con¬ 
ducted.  To  maximize  power,  equal  numbers  of  radiologists 
are  assigned  to  control  and  intervention  groups.  Therefore 
the  number  of  radiologists  per  group  to  be  considered  for 
power  calculation  purposes  will  be  in  the  range  of  20—40. 
Experience  suggests  that  readers  can  comfortably  read  no 
more  than  45  films  per  session.  We  therefore  calculated 
power  for  experiments  in  which  the  number  of  films  per  set, 
M,  was  either  30  or  45. 

Estimates  of  power  based  on  computer  simulations  are 
shown  in  Table  1 .  TTiough  results  are  shown  only  for  inter¬ 
vention  effects  on  sensitivity  with  no  effect  on  specificity, 
because  of  the  symmetry  inherent  in  the  design,  the  same 
power  calculations  hold  for  a  10%  change  in  specificity  with 
no  change  in  the  sensitivity.  Observe  that  the  power  is  far 
larger  for  the  within  intervention  group  assessment  of 


change  than  for  the  between  group  comparison  of  change. 
This  is  to  be  expected  since  the  variability  involved  in  com¬ 
paring  two  random  changes  is  greater  than  the  variability 
involved  in  comparing  a  single  change  with  the  null  hy¬ 
pothesis  of  no  change.  We  also  observe  from  Table  1  that 
the  power  is  less  when  the  baseline  sensitivity  is  0.70  than 
when  it  is  0.80.  This  is  due  to  the  relatively  larger  binomial 
variance  for  the  lower  baseline  rate.  To  be  conservative  we 
focus  on  this  lower  rate.  Interestingly,  the  baseline  specific¬ 
ity  had  little  impact  on  the  power  to  detect  an  intervention 
effect  on  the  sensitivity. 

The  target  power  for  our  study  design  is  90%,  which 
allows  a  10%  chance  of  an  inconclusive  result  when  the 
intervention  increases  sensitivity  from  0.70  to  0.80.  For  the 
within  intervention  group  comparison  this  cannot  be 
achieved  with  20  readers,  but  it  can  be  achieved  with  30 
readers  if  45  images  are  included  in  each  image  set.  The 
between  group  comparison,  however,  has  a  power  of  only 
66%  in  this  case.  Even  with  use  of  our  maximum  resources, 
i.e.,  40  readers  per  group  and  45  images  per  reading  set,  the 
power  is  only  80%.  This  allows  for  a  20%  chance  of  an 
inconclusive  result  even  when  there  is  a  clinically  impor¬ 
tant  intervention  effect  on  diagnostic  accuracy. 

For  the  MQIP  study  we  chose  not  to  include  a  control 
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TABLE  2.  Study  power  to  detect  various  configurations  of 
changes  in  the  intervention  group  using  a  study  design  with 
30  readers  and  45  films  per  set 

Pre-intervention 

sensitivity  _  Jj-(sens) _ Jj-fspec) _ Power 

0.60 
0.70 
0.80 
0.60 
0.70 
0.80 
0.60 
0.70 
0.80 

The  pre- intervention  specificity  is  assumed  to  be  0.70  in  all  cases.  The 
intervention  induced  change  in  sensitivity  as  denoted  zlrfsens)  and  in 
specificity  is  denoted  Axfspec). 


+0.10 

0.00 

0.90 

+0.10 

0.00 

0.95 

+0.10 

0.00 

0.98 

+0.05 

0.00 

0.35 

+0.05 

0.00 

0.39 

+0.05 

0.00 

0.50 

+0.05 

+0.05 

0.66 

+0.05 

+0.05 

0.68 

+0.05 

+0.05 

0.71 

group  in  the  reading  study  component,  but  instead  to  focus 
the  study  on  the  within  group  comparison.  The  power  cal¬ 
culations  were  an  important  contribution  to  this  decision 
but  other  considerations  also  played  a  role.  Radiologists 
would  have  little  motivation  to  participate  in  the  control 
arm  whereas  they  would  receive  continuing  medical  educa¬ 
tion  (CME)  credit  for  participation  in  the  intervention  arm. 
The  possibility  that  those  in  the  control  arm  would  learn 
from  the  baseline  assessment  was  also  a  concern  and  thus 
we  were  concerned  that  it  might  not  even  be  feasible  to 
construct  a  true  control  group.  Finally,  it  was  felt  that  if  we 
found  a  definite  positive  change  in  the  intervention  group, 
then  this  would  provide  sufficient  motivation  to  proceed 
with  more  comprehensive  controlled  studies  in  the  future. 
Thus  we  chose  to  study  only  the  intervention  effects  in  the 
intervention  group  and  to  use  sample  sizes  of  30  radiologists 
each  reading  sets  of  mammograms  from  45  women  before 
and  after  intervention. 

The  simulation  program  allowed  us  the  flexibility  to  ex¬ 
plore  the  performance  of  this  study  design  in  a  variety  of 
settings  other  than  that  assumed  for  the  primary  sample  size 
calculation.  First  we  calculated  the  probability  of  rejecting 
the  null  hypothesis  for  settings  where  there  was  no  inter¬ 
vention  effect.  Recall  that  inference  for  the  test  statistic  is 
based  on  a  chi-square  statistic  and  is  theoretically  valid  with 
large  samples.  However,  this  study  entails  relatively  small 
samples.  We  used  the  simulations  to  check  the  adequacy  of 
the  large  sample  theory  in  our  study.  To  do  this  we  gener¬ 
ated  data  under  the  null  hypothesis.  The  rejection  probabil¬ 
ity  was  approximately  0.06  in  the  settings  we  studied,  indi¬ 
cating  that  the  true  significance  level  of  the  test  is  slightly 
higher  than  the  target  of  0.05  but  adequate  for  our  purposes. 

We  next  explored  the  power  of  this  study  design  and  sam¬ 
ple  sizes  to  detect  an  array  of  intervention  effects.  Results 
are  shown  in  Table  2.  Although  the  study  has  adequate 
power  to  detect  a  change  in  sensitivity  (or  specificity)  of 
0.10  even  when  the  pre-intervention  sensitivity  is  as  low 
as  0.60,  it  has  little  chance  of  detecting  a  smaller  change 


of  0.05.  On  the  other  hand'  if  small  changes  of  the  order  of 
0.05  occur  in  both  the  average  sensitivity  and  in  the  average 
specificity  there  is.  a  good  chance  that  the  simultaneous  ef¬ 
fects  will  be  detected. 

7.  DISCUSSION 

Diagnostic  imaging  technology  is  already  a  basic  component 
of  medical  care  and  continues  to  develop  at  a  rapid  pace. 

It  is  clearly  important  to  assess  the  accuracy  with  which 
readers  can  diagnose  disease  using  such  technologies,  to 
evaluate  the  effects  of  training  strategies  and  to  compare 
methods.  Implications  for  public  health  can  be  enormous. 
Unfortunately,  statistical  methodology  for  evaluating  and 
comparing  imaging  methods  has  not  received  much  atten¬ 
tion  by  biostatisticians  and  epidemiologists  involved  in  pub¬ 
lic  health  research.  Rather  the  literature  is  concentrated  in 
radiology  research  journals,  has  generally  focused  on  small 
scale  studies  involving  only  a  few  readers  and  has  ignored 
clinical  implications  associated  with  different  diagnostic 
categories.  ^017e  believe  that  it  is  time  to  bring  the  discussion 
about  study  design  and  analysis  for  evaluating  imaging  tech¬ 
nology  to  the  broader  community  of  epidemiologists  and 
statisticians  involved  in  public  health.  This  is  particularly 
important  as  interest  increases  in  the  accuracies  and  costs 
of  these  imaging  methods.  By  presenting  our  thoughts  on 
the  design  and  analysis  of  a  study  to  evaluate  an  educational 
intervention  on  the  interpretation  of  mammograms,  we 
hope  to  stimulate  such  discussion. 

The  choice  of  primary  outcome  measure  is  the  most  basic 
element  of  any  study  design.  We  chose  to  consider  the  sensi¬ 
tivity  and  specificity  as  the  basis  for  evaluating  intervention 
effects.  This  conflicts  with  initial  statistical  reviewers  of  our 
study  design  who  were  of  the  opinion  that  ROC  analysis  was 
the  only  appropriate  and  indeed  the  state-of-the-art  basis  for 
evaluating  an  intervention  effect.  We  now  argue  that  in 
mammography  where  specific  clinical  actions  are  associated 
with  diagnostic  rating  categories,  sensitivity,  and  specificity 
provide  a  more  clinically  relevant  and  conceptually 
straightforward  basis  for  comparison  than  does  ROC  analy¬ 
sis.  Moreover  this  approach  allows  us  to  evaluate  effects  on 
false  positive  as  well  as  true  positive  rates.  In  contrast  ROC 
analysis  does  not  quantify  the  false  positive  rates  directly 
but  in  a  sense  only  uses  it  to  standardize  the  true  positive 
rate.  We  do  not  dismiss  ROC  analysis  entirely  but  rather 
we  regard  the  analysis  of  the  specific  rating  categories  of 
secondary  importance  and  focus  the  design  on  sensitivity 
and  specificity.  Thus  the  MQIP  study  was  designed  to  ensure 
adequate  power  to  detect  changes  in  the  most  clinically  rel¬ 
evant  quantities. 

We  also  needed  to  decide  upon  the  analysis  techniques 
for  making  statistical  inference  about  sensitivity  and  speci¬ 
ficity.  We  propose  to  simultaneously  estimate  sensitivity 
and  specificity  using  multivariate  methods.  Sensitivity  and 
specificity  as  we  have  defined  them  are  average  sensitivities 
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and  average  specificities  of  radiologists  in  our  study.  TTiey 
can  also  be  interpreted  as  marginal  or  population  average 
quantities,  in  the  sense  of  being  the  probability  that  a  dis- 
eased  (or  non-diseased)  image  will  be  correctly  interpreted 
as  such  in  the  study.  The  distinction  between  the  popula' 
tion  average  and  average  radiologist-specific  interpretations 
has  to  do  with  whether  one  considers  the  accuracy  parame¬ 
ters  to  be  based  on  data  pooled  across  radiologists  (popula¬ 
tion  average)  or  to  be  based  on  calculation  of  the  accuracy 
parameter  for  each  radiologist  and  then  averaging  the  re¬ 
sults.  In  our  study  these  quantities  coincide  because  all  radi¬ 
ologists  expect  to  read  the  same  numbers  of  films.  In  studies 
where  this  is  not  the  case,  the  distinction  should  be  consid¬ 
ered  and  a  decision  should  be  made  regarding  which  of  the 
two  entities  is  most  relevant. 

The  approach  we  propose  for  statistical  inference  is  rela¬ 
tively  straightforward,  being  based  on  methods  for  inference 
about  sample  means.  Confidence  intervals  are  based  on  the 
variance-covariance  matrix  of  the  estimated  (sensitivity, 
specificity)  parameters  or  their  changes  amongst  radiolo¬ 
gists.  Possible  non-normality  of  the  average  estimates  may 
be  an  issue  in  our  study,  though  for  the  settings  considered 
in  the  power  calculation  this  did  not  appear  to  be  the  case. 
An  alternative  approach  to  inference  which  might  be  more 
robust  would  follow  the  marginal  regression  modeling  ap¬ 
proach  described  by  Leisenring,  Pepe,  and  Longton  [17]. 
One  could  formulate  logistic  regression  models  for  the  popu¬ 
lation  average  sensitivity  and  1 -specificity  as 

logit  {Prob[screen  positive  |  image  diseased]} 

=  ro  +  7ib 

logit  {Prob[screen  positive  |  image  non-diseased]} 

=  %  +  mb 

where  the  logit  function  is  logit  {x}  =  in  {x/(l  -  x)}  and 
b  is  0  if  the  image  was  read  before  the  intervention  and  1 
if  it  was  read  after  the  intervention.  The  changes  in  the 
true  and  false  positive  rates  are  now  quantified  in  the  odds 
ratio  parameters  and  771,  respectively,  and  joint  confi¬ 
dence  intervals  can  be  calculated.  By  adding  an  interaction 
term  between  b  and  1,  where  I  is  an  indicator  of  the  radiolo¬ 
gist  being  in  the  control  or  intervention  groups: 

logit  {Prob[screen  positive  |  image  diseased]} 

=  70+  rib  +  ribl 

logit  {Prob[screen  positive  |  image  non-diseased]} 

=  7/0  +  mb  +  mbi 

a  comparison  of  the  changes  in  the  intervention  and  control 
groups  can  be  made  by  testing  if  the  parameters  /z  or  m  are 
0.  Though  this  logistic  regression  modeling  approach  may 
provide  more  robust  confidence  intervals,  we  felt  that  the 
simpler  approach  described  earlier  was  adequate  for  power 
calculations. 


The  prototype  reading  study  we  have  described  concerns 
evaluating  the  effect  of  an  intervention  on  the  change  in 
accuracy  parameters.  We  note,  however,  that  most  of  our 
discussion  is  also  relevant  to  the  comparison  of  accuracies 
associated  with  different  imaging  modalities.  Suppose  for 
example,  that  there  are  two  sets  of  women  (denoted  by  set 
1  and  set  2)  from  which  images  have  been  made  using  two 
modalities.  A  natural  study  design  to  compare  the  modal¬ 
ities  would  entail  readers  assigned  to  read  one  set  of  films 
produced  with  one  modality  and  the  other  set  of  films  pro¬ 
duced  with  the  other  modality.  Using  the  notation  1(A)  to 
denote  set  1  produced  with  modality  A  and  similarly  for  the 
other  combination,  readers  read  either  {1(A)  and  2(B)}  or 
{2(A)  and  1(B)}.  Considering  that  the  ordering  may  also 
influence  accuracy  parameters,  this  yields  four  groups  of 
readings,  {1(A),  2(B)},  {2(B),  1(A)},  {2(A),  1(B)}  and 
{1(B),  2(A)}.  A  balanced  cross-over  design  would  assign 
radiologists  randomly  to  these  four  reading  assignments. 
The  difference  in  the  sensitivity  and  specificity  between 
modality  A  and  B  can  be  calculated  by  simply  pooling  all 
relevant  readings  for  modality  A  and  similarly  for  modality 
2.  Inference  for  the  difference  follows  in  the  same  fashion 
as  that  described  for  the  change  induced  by  intervention  in 
the  intervention  group  of  our  study  but  that  now  there  are 
4  rather  than  2  strata  of  radiologists  defined  by  the  image 
reading  set  assignments. 

Power  calculations  for  reading  studies  are  not  straightfor¬ 
ward  due  in  part  to  correlations  induced  by  images  and  read¬ 
ers.  That  is,  for  each  image  there  are  multiple  readings. 
Moreover,  each  reader  provides  multiple  readings  and  radi¬ 
ologist  specific  sensitivities  and  specificities  are  correlated. 
We  propose  simple  analyses  for  dealing  with  these  factors 
but  power  calculations  required  a  computer  simulation  ap¬ 
proach.  We  found  the  process  of  developing  the  computer 
simulation  study  to  be  a  useful  exercise.  It  compels  one  to 
think  through  the  processes  generating  study  data.  It  also 
allows  one  to  experiment  with  the  assumptions  and  design 
easily.  For  example,  we  considered  designs  that  included  a 
larger  number  of  film  sets  to  be  read  in  the  study  and  found 
that  the  study  power  was  decreased  slightly  due  to  the  extra 
variation  introduced.  Computer  simulations  also  allow  one 
to  check  how  test  statistics  perform  under  the  null  hypothe¬ 
sis  with  sample  sizes  proposed  in  the  study.  Hence  one  can 
check  if  inference  based  on  large  sample  theory  is  valid  in 
the  setting  where  it  is  to  be  applied.  We  suggest  that  simula¬ 
tion  studies  are  a  useful  approach  to  power  calculations  in 
any  setting,  though  given  the  complexities  in  radiology 
reading  studies,  the  case  for  the  technique  in  this  setting  is 
particularly  strong. 
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appendix  a 

1.  VARIANCE  ESTIMATORS  FOR  CHANGE  IN 
OVERALL  SENSITIVITY  AND  SPECIFICITY 

The  change  in  the  overall  sensitivity  defined  in  Section  4  can  be 
written  formally  mathematically  as 


^T-Isensitivity)  ^  S,,r,U 

"T  I.Z) 

T  ^  ”  (Sr.pre) 

'  2,1) 


where  is  the  observed  sensitivity  for  radiologist  r  with  his  pre¬ 
intervention  film  set  and  is  the  corresponding  quantity  post¬ 
intervention.  Observe  that  the  order  of  film  sets  essentially  defines 
two  strata  in  this  setting  and  the  notation  (order  =  1,2)  (or  [order 
=  2,1])  used  to  denote  the  stratum  in  the  summation  indicates 
that  it  includes  only  tadiologists  assigned  sets  in  the  order  set  1 
first  and  set  2  second  (or  set  2  first  and  set  1  second).  The  variance 
of  ^  estimated  using  the  variance  of  a  stratified 

sample  mean  V  =  0.5(V„,2)  +  V,2,,))/Rt,  where  V,,.,,  is  the  sample 
variance  of  the  quantities  (S,,p„  -  1.^,)  in  the  stratum  (order  - 
1,2),  and  V2  is  the  analogous  quantity  in  the  other  stratum.  The 
ratio  iT(sensitivity)/A/v  can  be  compared  with  a  standard  normal 
distribution  to  test  for  a  change  in  the  sensitivity  which  is  statisti¬ 
cally  significantly  different  from  0. 


2.  Chi-Square  Test  Statistics  for  Bivariate  Analyses 

To  simultaneously  test  the  null  hypotheses  that  both  the  sensitiv¬ 
ity  and  specificity  are  unchanged  in  the  intervention  group.  Ho: 
2lT(sensitivity)  =  0  =  zlrlspecificity),  the  following  test  statistic 
can  be  used 

(sensitivity) 
(specificity)  ^ 


[irfsensitivity)  it  (specificity)] 


where  the  square  bracket  notation  is  used  to  denote  vectors  and 
i^'  is  the  inverse  of  a  square  matrix  ij-  TEis  matrix  Jt  is  a  vari¬ 
ance-covariance  matrix  for  the  two-dimensional  statistic  [ilT(sen- 
sitivity)  irfspecificity)],  and  is  the  analogue  of  the  variance  V 
defined  above  in  relation  to  the  one-dimensional  quantity  zlrfsen- 
sitivity).  Formally  we  write 


Mr  -  1) 


where  iV'^’  is  the  sample  variance-covariance  matrix  for  the  quan¬ 
tities  {S,.^  -  S,^  -  F,.^}  in  the  stratum  (order  =  1,2), 

and  iV^'"  is  the  analogous  quantity  calculated  for  the  other  stra¬ 
tum.  The  test  statistic  is  compared  with  a  standard  chi-square  dis¬ 
tribution  with  2  degrees  of  freedom  in  order  to  test  the  null  hy¬ 
pothesis  concerning  changes  in  sensitivities  and  specificities. 

Consider  now  the  component  of  the  data  analysis  concerning 
the  comparison  of  changes  between  intervention  and  control 
groups.  Using  a  subscript  C  to  denote  the  control  group  in  analogy 
with  our  use  of  the  subscript  T  to  denote  the  intervention  group, 
we  define  the  statistics  ic(sensitivity),  ic (specificity)  and  I'c- 
The  estimated  differences  between  the  groups  in  changes  of  sensi¬ 
tivities  and  specificities  can  be  written  as  (sensitivity)  - 
Ac(sensitivity)  and  iT( specificity)  -  4c (specificity),  respectively. 
The  hypothesis  that  the  changes  are  the  same  for  intervention  and 
control  groups  can  be  tested  by  comparing  the  statistic 

[irfsens)  -  ic(sens)  it  (spec)  -  4c  (spec)] 

'  4T(sens)  —  4c(sens) 
4T(spec)  -  4c(spec)_ 
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with  the  quantiles  of  a  chi-square  distribution  with  2  degrees  of 
freedom,  where  we  use  the  abbreviations  “sens”  and  “spec”  to  de¬ 
note  “sensitivity”  and  “specificity”  in  the  above  expressions. 

3,  Mbced  Models  for  Reading  Accuracies 

Section  5  outlines  a  statistical  model  for  sensitivity  and  specificity 
parameters  which  vary  with  reader  and  image.  Here  we  present  a 
more  formal  and  precise  definition  of  this  model.  For  radiologist 
r  on  diseased  film  i,  we  write  the  chance  of  correctly  identifying 
it  as  diseased  pre- intervention  using  a  logistic  model  as 

P?,  =  exp  +jf  +  p?}l(\  +  exp  +  P?)) 

where  and  )SP  are  random  variables  specific  to  this  film  and 
radiologist,  respectively.  For  the  average  radiologist  =  0,  and 
for  the  average  film  ==  0.  TTius  for  the  average  radiologist  on 
the  average  film  the  sensitivity  is  =  exp{ju^}l(l  +  exp{/i^.}). 
The  films  vary  in  difficulty  in  the  sense  that  the  average  radiologist 
has  a  lower  sensitivity  on  some  films  and  a  higher  sensitivity  on 
others.  Mathematically  this  translates  into  allowing  yP  to  vary. 
We  choose  it  as  a  random  variable  so  that  the  average  radiologist’s  ^ 
sensitivity  to  different  films  varies  uniformly  in  an  interval  (S^  — 
a^j  +  aP).  Technically  this  is  achieved  by  letting  =  In 
{UP/(1  “  Up)}  —  JU^  where  Up  is  a  random  variable  with  a  uni¬ 
form  distribution  in  (S^  -  aP,  +  a^).  The  radiologists  also  vary 
amongst  themselves  in  their  sensitivities  to  the  same  film  and  this 
inter-rater  variation  translates  into  allowing  PP  to  vary.  We  simu¬ 
lated  data  so  that  on  the  average  diseased  film  (i.e.,  =  0)  the 

sensitivities  of  radiologists  varied  uniformly  in  (S^  —  +  b°). 

Again,  technically  we  let  PP  =  In  {U?/(l  —  U?)}  “  where 
up  is  a  random  variable  with  a  uniform  distribution  on  the  interval 
(S^  -  +  h^). 

Turning  now  to  specificities,  we  write  the  specificity  for  radiolo¬ 
gist  r  on  non-diseased  film  j  pre-intervention  as 

p?/  =  exp  +  rf  +  P?W  +  exp  +  yf  +  yS?}) 

where  in  analogy  with  the  above  notation  for  diseased  films,  the 


average  .  radiologist  on  the  average  film  has  specificity  = 
exp{/l^}/(l  +  exp{//'^)  and  parameters  and  indicate  varia¬ 
tion  in  the  specificity  with  film  and  radiologist.  As  argued  in  sec¬ 
tion  5,  data  should  be  generated  so  that  the  pP  and  PPaxe  nega¬ 
tively  correlated.  We  incorporated  this  into  die  simulation  by  first 
generating  the  sensitivity  radiologist-specific  random  effect  param¬ 
eter,  pPy  (i.e.,  his/her  sensitivity  to  the  average  film)  which  is 
based  on  the  random  variable  UP,  and  then  letting  the  correspond¬ 
ing  random  variable  for  the  specificity  random  effect  be  defined 
as 


U?  = 


-  (U?  -  S^) 


Thus  if  the  radiologist’s  sensitivity  is  x  X  above  the  average 
radiologist’s  sensitivity  to  the  average  film,  S^,  his/her  specificity 
will  be  X  X  below  the  average  specificity  to  the  average  film. 

.  Our  model  postulates  that  after  intervention  the  quantities  F^ 
and  are  changed  to  new  values  but  that  the  radiologist  and 
image-specific  parameters  remain  unchanged.  Thus,  suppose  that 
after  intervention  the  sensitivity  of  the  average  radiologist  to  the 
average  film  is  exp (//^  4-  cP}l(\  +  exp{//^  +  Then  the 

chances  that  radiologist  r  will  correctly  classify  film  i  pre-  and  post¬ 
intervention  are 

PtV  =  exp{//°  +  r?  +  P?}/(1  +  exp{/j‘’  +  r?  +  y3?}) 

and 


=  exp{ju^  +  a°+r?  +  P?} 

/(I  -f  exp{//^  +  +  r?  +  /??})» 

respectively.  Similarly  the  postulated  change  in  F*^  specifies  a  pa¬ 
rameter  oP  (analogous  to  O^)  which  facilitates  calculation  of  post¬ 
intervention  specificities.  Having  chosen  values  for  the  various  pa¬ 
rameters  (//^,  a^,  b*^)  and  (//^,  oP,  aP^  b®),  this  completes 

the  first  step  of  the  simulation  power  calculation  method,  namely 
specification  of  accuracy  parameter  distributions  pre-intervention 
and  intervention  effects. 


